36
SFMap: Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs TMA 2015 Tatsuya Mori 1 , Takeru Inoue 2 , Akihiro Shimoda 3 , Kazumichi Sato 3 , Keisuke Ishibashi 3 , and Shigeki Goto 1 1 Waseda University 2 NTT Network Innovation Laboratories 3 NTT Network Technology Laboratories

SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

SFMap Inferring Services over Encrypted Web Flows using

Dynamical Domain Name Graphs TMA 2015

Tatsuya Mori1 Takeru Inoue2 Akihiro Shimoda3

Kazumichi Sato3 Keisuke Ishibashi3 and Shigeki Goto1

1 Waseda University 2 NTT Network Innovation Laboratories

3 NTT Network Technology Laboratories

Background(1) Era of web Change of Internet traffic WIDE Mawi Project httpmawiwideadjp samplepoint B F

1

2002121

2012121

Many of primary Internet services have shifted to Web (http) Everything over HTTP

Era of P2P

Era of Web

Background(2) Encrypting Web

2

D Naylor et al The Cost of the ldquoSrdquo in HTTP Proceedings of ACM CoNext 2014

Deploying HTTPS is not cost any more Significant portion of web traffic is now encrypted

YouTube video over HTTPS

3

Netflix started encrypting stream

HTTP = non-secure

6

ISPs need to understand traffic mix bull to figure out what to control in the presence

of congestion ndash Shaping HTTP flows is too coarse-grained ndash Shaping flows from a range of IP addresses is also too

coarse-grained

bull to know demand of end-users ndash What types of services are consuming network resources rarr Can be used to rethink new architecture or business model peering policy installing cache mechanism WAN optimization CCNICN

bull Obstacle coping with HTTPS 7

HTTP vs HTTPS bull HTTP

ndash HTTP header composes of URL information

bull HTTPS ndash Entire HTTP protocol including header is

encrypted No URL information is available

8

HTTP header

TCP header

IP header

HTTP payload

TCP header

IP header SSLTLS encryption

httpwwwexamplecom

Solution 1 Server IP addresses

bull Many of IP addresses can be reverse looked up (PTR record)

bull There are many IP addresses that are not configured to have PTR records

bull A single IP address can be associated with many distinct FQDNS (cloud hosting services etc)

9

Solution 2 SSLTLS certificates bull Public key certificate of server is exchanged

during SSLTLS handshake stage The certificate should contain domain name of the server

bull An organization can register a single certificate for many sub-domains ie so called wildcard certificates ndash Eg googlecom

10

Solution 3 SNI extension bull SNI (Server Name Identification) of TLS

can be used to obtain FQDN of an HTTPS server

bull Many of clientserver implementations have not adopted SNI yet ndash In our dataset roughly half of HTTPS clients

did not use the SNI extension

11

Solution 4 Decrypting HTTPS bull Anti-virus software or firewall products have

mechanisms to intercept HTTPS traffic

bull They use self-signed certificates to work as a transparent HTTP(S) proxy ndash Same as the MTIM (Man-in-the-middle) attack ndash Needs for installation of certificates for each

OSapplication

bull [cf] IETF Explicit Trusted Proxy in HTTP20 (I-D expired)

12

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 2: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Background(1) Era of web Change of Internet traffic WIDE Mawi Project httpmawiwideadjp samplepoint B F

1

2002121

2012121

Many of primary Internet services have shifted to Web (http) Everything over HTTP

Era of P2P

Era of Web

Background(2) Encrypting Web

2

D Naylor et al The Cost of the ldquoSrdquo in HTTP Proceedings of ACM CoNext 2014

Deploying HTTPS is not cost any more Significant portion of web traffic is now encrypted

YouTube video over HTTPS

3

Netflix started encrypting stream

HTTP = non-secure

6

ISPs need to understand traffic mix bull to figure out what to control in the presence

of congestion ndash Shaping HTTP flows is too coarse-grained ndash Shaping flows from a range of IP addresses is also too

coarse-grained

bull to know demand of end-users ndash What types of services are consuming network resources rarr Can be used to rethink new architecture or business model peering policy installing cache mechanism WAN optimization CCNICN

bull Obstacle coping with HTTPS 7

HTTP vs HTTPS bull HTTP

ndash HTTP header composes of URL information

bull HTTPS ndash Entire HTTP protocol including header is

encrypted No URL information is available

8

HTTP header

TCP header

IP header

HTTP payload

TCP header

IP header SSLTLS encryption

httpwwwexamplecom

Solution 1 Server IP addresses

bull Many of IP addresses can be reverse looked up (PTR record)

bull There are many IP addresses that are not configured to have PTR records

bull A single IP address can be associated with many distinct FQDNS (cloud hosting services etc)

9

Solution 2 SSLTLS certificates bull Public key certificate of server is exchanged

during SSLTLS handshake stage The certificate should contain domain name of the server

bull An organization can register a single certificate for many sub-domains ie so called wildcard certificates ndash Eg googlecom

10

Solution 3 SNI extension bull SNI (Server Name Identification) of TLS

can be used to obtain FQDN of an HTTPS server

bull Many of clientserver implementations have not adopted SNI yet ndash In our dataset roughly half of HTTPS clients

did not use the SNI extension

11

Solution 4 Decrypting HTTPS bull Anti-virus software or firewall products have

mechanisms to intercept HTTPS traffic

bull They use self-signed certificates to work as a transparent HTTP(S) proxy ndash Same as the MTIM (Man-in-the-middle) attack ndash Needs for installation of certificates for each

OSapplication

bull [cf] IETF Explicit Trusted Proxy in HTTP20 (I-D expired)

12

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 3: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Background(2) Encrypting Web

2

D Naylor et al The Cost of the ldquoSrdquo in HTTP Proceedings of ACM CoNext 2014

Deploying HTTPS is not cost any more Significant portion of web traffic is now encrypted

YouTube video over HTTPS

3

Netflix started encrypting stream

HTTP = non-secure

6

ISPs need to understand traffic mix bull to figure out what to control in the presence

of congestion ndash Shaping HTTP flows is too coarse-grained ndash Shaping flows from a range of IP addresses is also too

coarse-grained

bull to know demand of end-users ndash What types of services are consuming network resources rarr Can be used to rethink new architecture or business model peering policy installing cache mechanism WAN optimization CCNICN

bull Obstacle coping with HTTPS 7

HTTP vs HTTPS bull HTTP

ndash HTTP header composes of URL information

bull HTTPS ndash Entire HTTP protocol including header is

encrypted No URL information is available

8

HTTP header

TCP header

IP header

HTTP payload

TCP header

IP header SSLTLS encryption

httpwwwexamplecom

Solution 1 Server IP addresses

bull Many of IP addresses can be reverse looked up (PTR record)

bull There are many IP addresses that are not configured to have PTR records

bull A single IP address can be associated with many distinct FQDNS (cloud hosting services etc)

9

Solution 2 SSLTLS certificates bull Public key certificate of server is exchanged

during SSLTLS handshake stage The certificate should contain domain name of the server

bull An organization can register a single certificate for many sub-domains ie so called wildcard certificates ndash Eg googlecom

10

Solution 3 SNI extension bull SNI (Server Name Identification) of TLS

can be used to obtain FQDN of an HTTPS server

bull Many of clientserver implementations have not adopted SNI yet ndash In our dataset roughly half of HTTPS clients

did not use the SNI extension

11

Solution 4 Decrypting HTTPS bull Anti-virus software or firewall products have

mechanisms to intercept HTTPS traffic

bull They use self-signed certificates to work as a transparent HTTP(S) proxy ndash Same as the MTIM (Man-in-the-middle) attack ndash Needs for installation of certificates for each

OSapplication

bull [cf] IETF Explicit Trusted Proxy in HTTP20 (I-D expired)

12

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 4: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

YouTube video over HTTPS

3

Netflix started encrypting stream

HTTP = non-secure

6

ISPs need to understand traffic mix bull to figure out what to control in the presence

of congestion ndash Shaping HTTP flows is too coarse-grained ndash Shaping flows from a range of IP addresses is also too

coarse-grained

bull to know demand of end-users ndash What types of services are consuming network resources rarr Can be used to rethink new architecture or business model peering policy installing cache mechanism WAN optimization CCNICN

bull Obstacle coping with HTTPS 7

HTTP vs HTTPS bull HTTP

ndash HTTP header composes of URL information

bull HTTPS ndash Entire HTTP protocol including header is

encrypted No URL information is available

8

HTTP header

TCP header

IP header

HTTP payload

TCP header

IP header SSLTLS encryption

httpwwwexamplecom

Solution 1 Server IP addresses

bull Many of IP addresses can be reverse looked up (PTR record)

bull There are many IP addresses that are not configured to have PTR records

bull A single IP address can be associated with many distinct FQDNS (cloud hosting services etc)

9

Solution 2 SSLTLS certificates bull Public key certificate of server is exchanged

during SSLTLS handshake stage The certificate should contain domain name of the server

bull An organization can register a single certificate for many sub-domains ie so called wildcard certificates ndash Eg googlecom

10

Solution 3 SNI extension bull SNI (Server Name Identification) of TLS

can be used to obtain FQDN of an HTTPS server

bull Many of clientserver implementations have not adopted SNI yet ndash In our dataset roughly half of HTTPS clients

did not use the SNI extension

11

Solution 4 Decrypting HTTPS bull Anti-virus software or firewall products have

mechanisms to intercept HTTPS traffic

bull They use self-signed certificates to work as a transparent HTTP(S) proxy ndash Same as the MTIM (Man-in-the-middle) attack ndash Needs for installation of certificates for each

OSapplication

bull [cf] IETF Explicit Trusted Proxy in HTTP20 (I-D expired)

12

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 5: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Netflix started encrypting stream

HTTP = non-secure

6

ISPs need to understand traffic mix bull to figure out what to control in the presence

of congestion ndash Shaping HTTP flows is too coarse-grained ndash Shaping flows from a range of IP addresses is also too

coarse-grained

bull to know demand of end-users ndash What types of services are consuming network resources rarr Can be used to rethink new architecture or business model peering policy installing cache mechanism WAN optimization CCNICN

bull Obstacle coping with HTTPS 7

HTTP vs HTTPS bull HTTP

ndash HTTP header composes of URL information

bull HTTPS ndash Entire HTTP protocol including header is

encrypted No URL information is available

8

HTTP header

TCP header

IP header

HTTP payload

TCP header

IP header SSLTLS encryption

httpwwwexamplecom

Solution 1 Server IP addresses

bull Many of IP addresses can be reverse looked up (PTR record)

bull There are many IP addresses that are not configured to have PTR records

bull A single IP address can be associated with many distinct FQDNS (cloud hosting services etc)

9

Solution 2 SSLTLS certificates bull Public key certificate of server is exchanged

during SSLTLS handshake stage The certificate should contain domain name of the server

bull An organization can register a single certificate for many sub-domains ie so called wildcard certificates ndash Eg googlecom

10

Solution 3 SNI extension bull SNI (Server Name Identification) of TLS

can be used to obtain FQDN of an HTTPS server

bull Many of clientserver implementations have not adopted SNI yet ndash In our dataset roughly half of HTTPS clients

did not use the SNI extension

11

Solution 4 Decrypting HTTPS bull Anti-virus software or firewall products have

mechanisms to intercept HTTPS traffic

bull They use self-signed certificates to work as a transparent HTTP(S) proxy ndash Same as the MTIM (Man-in-the-middle) attack ndash Needs for installation of certificates for each

OSapplication

bull [cf] IETF Explicit Trusted Proxy in HTTP20 (I-D expired)

12

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 6: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

HTTP = non-secure

6

ISPs need to understand traffic mix bull to figure out what to control in the presence

of congestion ndash Shaping HTTP flows is too coarse-grained ndash Shaping flows from a range of IP addresses is also too

coarse-grained

bull to know demand of end-users ndash What types of services are consuming network resources rarr Can be used to rethink new architecture or business model peering policy installing cache mechanism WAN optimization CCNICN

bull Obstacle coping with HTTPS 7

HTTP vs HTTPS bull HTTP

ndash HTTP header composes of URL information

bull HTTPS ndash Entire HTTP protocol including header is

encrypted No URL information is available

8

HTTP header

TCP header

IP header

HTTP payload

TCP header

IP header SSLTLS encryption

httpwwwexamplecom

Solution 1 Server IP addresses

bull Many of IP addresses can be reverse looked up (PTR record)

bull There are many IP addresses that are not configured to have PTR records

bull A single IP address can be associated with many distinct FQDNS (cloud hosting services etc)

9

Solution 2 SSLTLS certificates bull Public key certificate of server is exchanged

during SSLTLS handshake stage The certificate should contain domain name of the server

bull An organization can register a single certificate for many sub-domains ie so called wildcard certificates ndash Eg googlecom

10

Solution 3 SNI extension bull SNI (Server Name Identification) of TLS

can be used to obtain FQDN of an HTTPS server

bull Many of clientserver implementations have not adopted SNI yet ndash In our dataset roughly half of HTTPS clients

did not use the SNI extension

11

Solution 4 Decrypting HTTPS bull Anti-virus software or firewall products have

mechanisms to intercept HTTPS traffic

bull They use self-signed certificates to work as a transparent HTTP(S) proxy ndash Same as the MTIM (Man-in-the-middle) attack ndash Needs for installation of certificates for each

OSapplication

bull [cf] IETF Explicit Trusted Proxy in HTTP20 (I-D expired)

12

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 7: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

ISPs need to understand traffic mix bull to figure out what to control in the presence

of congestion ndash Shaping HTTP flows is too coarse-grained ndash Shaping flows from a range of IP addresses is also too

coarse-grained

bull to know demand of end-users ndash What types of services are consuming network resources rarr Can be used to rethink new architecture or business model peering policy installing cache mechanism WAN optimization CCNICN

bull Obstacle coping with HTTPS 7

HTTP vs HTTPS bull HTTP

ndash HTTP header composes of URL information

bull HTTPS ndash Entire HTTP protocol including header is

encrypted No URL information is available

8

HTTP header

TCP header

IP header

HTTP payload

TCP header

IP header SSLTLS encryption

httpwwwexamplecom

Solution 1 Server IP addresses

bull Many of IP addresses can be reverse looked up (PTR record)

bull There are many IP addresses that are not configured to have PTR records

bull A single IP address can be associated with many distinct FQDNS (cloud hosting services etc)

9

Solution 2 SSLTLS certificates bull Public key certificate of server is exchanged

during SSLTLS handshake stage The certificate should contain domain name of the server

bull An organization can register a single certificate for many sub-domains ie so called wildcard certificates ndash Eg googlecom

10

Solution 3 SNI extension bull SNI (Server Name Identification) of TLS

can be used to obtain FQDN of an HTTPS server

bull Many of clientserver implementations have not adopted SNI yet ndash In our dataset roughly half of HTTPS clients

did not use the SNI extension

11

Solution 4 Decrypting HTTPS bull Anti-virus software or firewall products have

mechanisms to intercept HTTPS traffic

bull They use self-signed certificates to work as a transparent HTTP(S) proxy ndash Same as the MTIM (Man-in-the-middle) attack ndash Needs for installation of certificates for each

OSapplication

bull [cf] IETF Explicit Trusted Proxy in HTTP20 (I-D expired)

12

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 8: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

HTTP vs HTTPS bull HTTP

ndash HTTP header composes of URL information

bull HTTPS ndash Entire HTTP protocol including header is

encrypted No URL information is available

8

HTTP header

TCP header

IP header

HTTP payload

TCP header

IP header SSLTLS encryption

httpwwwexamplecom

Solution 1 Server IP addresses

bull Many of IP addresses can be reverse looked up (PTR record)

bull There are many IP addresses that are not configured to have PTR records

bull A single IP address can be associated with many distinct FQDNS (cloud hosting services etc)

9

Solution 2 SSLTLS certificates bull Public key certificate of server is exchanged

during SSLTLS handshake stage The certificate should contain domain name of the server

bull An organization can register a single certificate for many sub-domains ie so called wildcard certificates ndash Eg googlecom

10

Solution 3 SNI extension bull SNI (Server Name Identification) of TLS

can be used to obtain FQDN of an HTTPS server

bull Many of clientserver implementations have not adopted SNI yet ndash In our dataset roughly half of HTTPS clients

did not use the SNI extension

11

Solution 4 Decrypting HTTPS bull Anti-virus software or firewall products have

mechanisms to intercept HTTPS traffic

bull They use self-signed certificates to work as a transparent HTTP(S) proxy ndash Same as the MTIM (Man-in-the-middle) attack ndash Needs for installation of certificates for each

OSapplication

bull [cf] IETF Explicit Trusted Proxy in HTTP20 (I-D expired)

12

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 9: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Solution 1 Server IP addresses

bull Many of IP addresses can be reverse looked up (PTR record)

bull There are many IP addresses that are not configured to have PTR records

bull A single IP address can be associated with many distinct FQDNS (cloud hosting services etc)

9

Solution 2 SSLTLS certificates bull Public key certificate of server is exchanged

during SSLTLS handshake stage The certificate should contain domain name of the server

bull An organization can register a single certificate for many sub-domains ie so called wildcard certificates ndash Eg googlecom

10

Solution 3 SNI extension bull SNI (Server Name Identification) of TLS

can be used to obtain FQDN of an HTTPS server

bull Many of clientserver implementations have not adopted SNI yet ndash In our dataset roughly half of HTTPS clients

did not use the SNI extension

11

Solution 4 Decrypting HTTPS bull Anti-virus software or firewall products have

mechanisms to intercept HTTPS traffic

bull They use self-signed certificates to work as a transparent HTTP(S) proxy ndash Same as the MTIM (Man-in-the-middle) attack ndash Needs for installation of certificates for each

OSapplication

bull [cf] IETF Explicit Trusted Proxy in HTTP20 (I-D expired)

12

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 10: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Solution 2 SSLTLS certificates bull Public key certificate of server is exchanged

during SSLTLS handshake stage The certificate should contain domain name of the server

bull An organization can register a single certificate for many sub-domains ie so called wildcard certificates ndash Eg googlecom

10

Solution 3 SNI extension bull SNI (Server Name Identification) of TLS

can be used to obtain FQDN of an HTTPS server

bull Many of clientserver implementations have not adopted SNI yet ndash In our dataset roughly half of HTTPS clients

did not use the SNI extension

11

Solution 4 Decrypting HTTPS bull Anti-virus software or firewall products have

mechanisms to intercept HTTPS traffic

bull They use self-signed certificates to work as a transparent HTTP(S) proxy ndash Same as the MTIM (Man-in-the-middle) attack ndash Needs for installation of certificates for each

OSapplication

bull [cf] IETF Explicit Trusted Proxy in HTTP20 (I-D expired)

12

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 11: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Solution 3 SNI extension bull SNI (Server Name Identification) of TLS

can be used to obtain FQDN of an HTTPS server

bull Many of clientserver implementations have not adopted SNI yet ndash In our dataset roughly half of HTTPS clients

did not use the SNI extension

11

Solution 4 Decrypting HTTPS bull Anti-virus software or firewall products have

mechanisms to intercept HTTPS traffic

bull They use self-signed certificates to work as a transparent HTTP(S) proxy ndash Same as the MTIM (Man-in-the-middle) attack ndash Needs for installation of certificates for each

OSapplication

bull [cf] IETF Explicit Trusted Proxy in HTTP20 (I-D expired)

12

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 12: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Solution 4 Decrypting HTTPS bull Anti-virus software or firewall products have

mechanisms to intercept HTTPS traffic

bull They use self-signed certificates to work as a transparent HTTP(S) proxy ndash Same as the MTIM (Man-in-the-middle) attack ndash Needs for installation of certificates for each

OSapplication

bull [cf] IETF Explicit Trusted Proxy in HTTP20 (I-D expired)

12

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 13: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Goal bull Estimate server hostnames of HTTPS traffic

ndash Server hostnames can be used as a good hint to

estimate the services provided by the server ndash Eg wwwapplecom itunesapplecom hellip

bull Establish better performance than the existing solution (DN-Hunter)

13

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 14: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Idea bull Leverage DNS name resolutions that

precedes HTTPS transactions ndash Labeling data plane using control plane ndash This is not a simple task as we will describe soon ndash [cf] state-of-the-art = DN-Hunter (IMC 2012)

bull Use statistical inference when measurement

is incomplete ndash DNS resolutions can be missed due to some

reasons

14

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 15: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Illustration of DNS approach

15

Client C1

DNS resolver

HTTPS server S1

612131464 (akamai) NTT-COM

102023

612131464443 rArr 10202331587 = wwwapplecom (APPLE)

wwwapplecom rArr 612131464 from 102023

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 16: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Three practical challenges 1) CNAME tricks

2) Incomplete measurements

3) Dynamicity diversity and ambiguity

16

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 17: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

1) CNAME tricks bull Modern CDN providers heavily make use of

CNAME tricks to optimize content distribution bull We need to keep track of not only clientserver IP

addresseshostnames but also intermediate CNAMEs

17

s1 m1 n1 m2

232132181

e1630cakamaiedgenet (TTL=20)

wwwieeeorgedgekeynet (TTL=11002)

wwwieeeorg (TTL=60)

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 18: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

2) incomplete measurements bull Various DNS caching mechanisms in the

wild ndash Browserapps ndash OS ndash Home routers wDNS resolver ndash DNS resolvers (OrganizationISPOpen)

bull From the viewpoint of ISPs DNS queries

originated from end-users can be missed due to the intermediate caching mechanisms

18

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 19: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

3) Dynamicity diversity and ambiguity

bull A pathologicalpopular example

19

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 20: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

SFMap (Service-Flow Map)

20

DNS queries responses

HTTPS flow DNG

Sec 32 Flow key s c

Hostname n

input

output

Freq Counter Estimator

Sec 33

Learner

Updater Sec 34

Learning with monitored queries Building DNG (Domain Name Graph)

Hostname estimation using DNG Use maximum likelihood estimation when necessary

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 21: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Illustration of DNG

21

s1 n1 n2

c1 s2 m1

s3

n3

s3 c2

m2

n4

n6

s1 n5

s1 n1

n2

s2 m1

s3

n3

m2

n4

n6

n5

c1

c2

Per-client graphs (local DNG)

A global graph (union DNG)

m2

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 22: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Overview of hostname estimation algorithm (1)

bull Get clientserver IP addresses (CS) from an HTTPS flow

bull Search a set of hostnames N corresponding to (CS) on DNG ndash Enumerate edge nodes N that have

paths reachable from C to S on DNG ndash Also consider TTL expiration

bull If |N| = 1 it is the estimated hostname

22

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 23: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

An example

23

s3

c2 m2

m3

n6

n7

s1 n5

(c2 s1) estimation = n5 (c2 s3) candidates = n6 n7

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 24: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Overview of hostname estimation algorithm (2)

bull If there are multiple candidates sort them in descending order according to the likelihood probabilities ndash Uncertain events use frequencies

bull Second third candidates can be informative

ndash Note The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability)

24

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 25: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Updating DNG bull StatesStatistics of DNG is updated

online when a DNS query is observed

25

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 26: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Dataset bull LAB

ndash A small LAN used by research group bull PROD

ndash Middle-scale production network

26

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 27: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Scales of DNGs (12 hours long)

27

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 28: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Estimation accuracy (1)

28

UNION グラフ TTL expiration 無し UNION DNG without TTL expiration

Exact match

Public suffix match

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 29: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Estimation accuracy (2)

29

Accuracies of top-3 estimations (UE-NTE)

The top-3 ranked hostnames were similar in many cases eg pagead2 googlesyndicationcom pubadsgdoubleclicknet googleadsgdoubleclick net

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 30: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Discussion bull Sources of inevitable misclassification

ndash DNS implementations that ignore TTL expiration

bull It keeps holding old information ndash mobility

bull DNS could be resolved in different vantage point

ndash Hardcoded IP addresses bull Some gaming apps did have such mechanism

30

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 31: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Discussion (cont) bull Scalability

ndash Did not matter for our datasets ndash Size of DNG depends on the number of

client IP addresses bull Some aging mechanism should be

incorporated for much large-scale DNGs (future work)

bull URL=hostname + path ndash How can we deal with path ndash Need for a standard mechanism to explicitly

expose path like SNI 31

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 32: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Summary bull SFMap framework estimates hostnames

(~services) of HTTPS traffic using past DNS queries

bull Key ideasuse of DNG and statistical inference

bull SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) ndash Exact match 90-92 accuracies ndash Public suffix match 94-95 accuracies ndash Top-3 hit 97-98 accuracies

32

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 33: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Acknowledgements bull This work was supported by JSPS KAKENHI Grant

Number 25880020

33

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 34: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Existing research DN Hunter bull Bermudez et al ldquoDNS to the Rescue Discerning Content and

Services in a Tangled Webrdquo ACM IMC 2012

34

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 35: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Comparison with DN-Hunter

35

Distributed monitoring

Statistical estimation

DN Hunter SFMap

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring
Page 36: SFMap: Inferring Services over Encrypted Web Flows using ... · Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014

Distributed monitoring

36

ER

ER

ER

DNS resolver SFMap

DNS (Control-plane)

CR

GW

sampled NetFlow (data-plane)

  • SFMap Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs
  • Background(1) Era of web
  • Background(2) Encrypting Web
  • YouTube video over HTTPS
  • Netflix started encrypting stream
  • HTTP = non-secure
  • ISPs need to understand traffic mix
  • HTTP vs HTTPS
  • Solution 1 Server IP addresses
  • Solution 2 SSLTLS certificates
  • Solution 3 SNI extension
  • Solution 4 Decrypting HTTPS
  • Goal
  • Idea
  • Illustration of DNS approach
  • Three practical challenges
  • 1) CNAME tricks
  • 2) incomplete measurements
  • 3) Dynamicity diversity and ambiguity
  • SFMap (Service-Flow Map)
  • Illustration of DNG
  • Overview of hostname estimation algorithm (1)
  • An example
  • Overview of hostname estimation algorithm (2)
  • Updating DNG
  • Dataset
  • Scales of DNGs (12 hours long)
  • Estimation accuracy (1)
  • Estimation accuracy (2)
  • Discussion
  • Discussion (cont)
  • Summary
  • Acknowledgements
  • Existing research DN Hunter
  • Comparison with DN-Hunter
  • Distributed monitoring