7
User profiling in the time of HTTPS Roberto Gonzalez NEC Labs. Europe [email protected] Claudio Soriente Telefonica Research [email protected] Nikolaos Laoutaris Telefonica Research [email protected] ABSTRACT Tracking users within and across websites is the base for profiling their interests, demographic types, and other in- formation that can be monetised through targeted adver- tising and big data analytics. The advent of HTTPS was supposed to make profiling harder for anyone beyond the communicating end-points. In this paper we examine to what extent the above is true. We first show that by know- ing the domain that a user visits, either through the Server Name Indication of the TLS protocol or through DNS, an eavesdropper can already derive basic profiling infor- mation, especially for domains whose content is homo- geneous. For domains carrying a variety of categories that depend on the particular page that a user visits, e.g., news portals, e-commerce sites, etc., the basic profiling technique fails. Still, accurate profiling remains possi- ble through transport layer fingerprinting that uses net- work traffic signatures to infer the exact page that a user is browsing, even under HTTPS. We demonstrate that trans- port fingerprinting remains robust and scalable despite hur- dles such as caching, dynamic content for different de- vice types etc.Overall our results indicate that although HTTPS makes profiling more difficult, it does not eradi- cate it by any means. 1. INTRODUCTION Online user profiling is a profitable business ex- tensively carried out by third parties such as search engines, ad networks and network providers. It lever- ages browsing activities to infer user interests and intentions. Since HTTP traffic has no privacy pro- visions, any third party can pry on the connections to a websever and profile users. HTTPS enhances online user privacy by encrypting the communication between a browser and a webserver. Major internet stakeholders are pushing for an HTTPS everywhere web with the promise of increased security and pri- vacy and, therefore, of mitigating the problem of user profiling by third parties. In this paper we assess the extent by which HTTPS prevents third parties from profiling users based on the websites they visit. We show that the widely used Server Name Indication (SN) extension of the TLS protocol leaks user interests to third parties which eavesdrop on the (encrypted) connection between a client and an HTTPS webserver. The SN extension improves address-space utilization as it allows to con- solidate several HTTPS webservers at a given IP ad- dress. However, SN also hinders user privacy as it leaks the domain requested by a user, despite the HTTPS pledge of a secure and private connection. The privacy leakage due to SN is especially se- vere in websites with homogenous content across its pages. For example, a connection request to www. foxsports.com tells a lot about the user interests, regardless of the actual page the user is browsing within that website. For a website with more va- rieties across its pages, the domain requested by a user may not tell enough about the interests of that user. For example, a connection to www.amazon.com may not tell much about a user. However a connec- tion to www.amazon.com/books would reveal interests in books and a connection to www.amazon.com/baby may indicate an intent to buy baby items. We show that by using fingerprinting techniques, a network eavesdropper can accurately tell the page a user is browsing within a domain and, therefore, build a re- fined user profile. Users profiling despite HTTPS is achievable, given enough bandwidth to fingerprint the websites under observation. If bandwidth is an issue, we also define an optimization problem that allows an eavesdropper to periodically pick the websites to fingerprint in order to maximize the number of users that are correctly profiled over time. Overall, our findings show that HTTPS, while be- ing a formidable tool to strengthen the security of web applications, cannot protect users against online profiling by third parties. 2. BACKGROUND AND MODEL User profiling. Profiling systems often use a closed-source mapping between URLs and interest categories. We follow 1

User profiling in the time of HTTPS - Laoutarislaoutaris.info/wp-content/uploads/2016/08/imc2016-paper69.pdf · HTTPS and Server Name Indication. HTTPS enhances HTTP with the Transport

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: User profiling in the time of HTTPS - Laoutarislaoutaris.info/wp-content/uploads/2016/08/imc2016-paper69.pdf · HTTPS and Server Name Indication. HTTPS enhances HTTP with the Transport

User profiling in the time of HTTPS

Roberto GonzalezNEC Labs. Europe

[email protected]

Claudio SorienteTelefonica Research

[email protected]

Nikolaos LaoutarisTelefonica Research

[email protected]

ABSTRACTTracking users within and across websites is the base forprofiling their interests, demographic types, and other in-formation that can be monetised through targeted adver-tising and big data analytics. The advent of HTTPS wassupposed to make profiling harder for anyone beyond thecommunicating end-points. In this paper we examine towhat extent the above is true. We first show that by know-ing the domain that a user visits, either through the ServerName Indication of the TLS protocol or through DNS,an eavesdropper can already derive basic profiling infor-mation, especially for domains whose content is homo-geneous. For domains carrying a variety of categoriesthat depend on the particular page that a user visits, e.g.,news portals, e-commerce sites, etc., the basic profilingtechnique fails. Still, accurate profiling remains possi-ble through transport layer fingerprinting that uses net-work traffic signatures to infer the exact page that a user isbrowsing, even under HTTPS. We demonstrate that trans-port fingerprinting remains robust and scalable despite hur-dles such as caching, dynamic content for different de-vice types etc.Overall our results indicate that althoughHTTPS makes profiling more difficult, it does not eradi-cate it by any means.

1. INTRODUCTIONOnline user profiling is a profitable business ex-

tensively carried out by third parties such as searchengines, ad networks and network providers. It lever-ages browsing activities to infer user interests andintentions. Since HTTP traffic has no privacy pro-visions, any third party can pry on the connectionsto a websever and profile users. HTTPS enhancesonline user privacy by encrypting the communicationbetween a browser and a webserver. Major internetstakeholders are pushing for an HTTPS everywhereweb with the promise of increased security and pri-vacy and, therefore, of mitigating the problem of userprofiling by third parties.

In this paper we assess the extent by which HTTPSprevents third parties from profiling users based onthe websites they visit. We show that the widely used

Server Name Indication (SN) extension of the TLSprotocol leaks user interests to third parties whicheavesdrop on the (encrypted) connection between aclient and an HTTPS webserver. The SN extensionimproves address-space utilization as it allows to con-solidate several HTTPS webservers at a given IP ad-dress. However, SN also hinders user privacy as itleaks the domain requested by a user, despite theHTTPS pledge of a secure and private connection.

The privacy leakage due to SN is especially se-vere in websites with homogenous content across itspages. For example, a connection request to www.

foxsports.com tells a lot about the user interests,regardless of the actual page the user is browsingwithin that website. For a website with more va-rieties across its pages, the domain requested by auser may not tell enough about the interests of thatuser. For example, a connection to www.amazon.com

may not tell much about a user. However a connec-tion to www.amazon.com/books would reveal interestsin books and a connection to www.amazon.com/baby

may indicate an intent to buy baby items. We showthat by using fingerprinting techniques, a networkeavesdropper can accurately tell the page a user isbrowsing within a domain and, therefore, build a re-fined user profile.

Users profiling despite HTTPS is achievable, givenenough bandwidth to fingerprint the websites underobservation. If bandwidth is an issue, we also definean optimization problem that allows an eavesdropperto periodically pick the websites to fingerprint in orderto maximize the number of users that are correctlyprofiled over time.

Overall, our findings show that HTTPS, while be-ing a formidable tool to strengthen the security ofweb applications, cannot protect users against onlineprofiling by third parties.

2. BACKGROUND AND MODEL

User profiling.Profiling systems often use a closed-source mapping

between URLs and interest categories. We follow

1

Page 2: User profiling in the time of HTTPS - Laoutarislaoutaris.info/wp-content/uploads/2016/08/imc2016-paper69.pdf · HTTPS and Server Name Indication. HTTPS enhances HTTP with the Transport

the approach of previous work [1] and instantiatethe mapping using the Display Planner of GoogleAdWords [2] – an online tool that given a URL returnsthe set of categories assigned by AdWords to that URL.Categories are arranged in a hierarchy and each URLhas, on average, 10 assigned categories. The DisplayPlanner also provides the inverse mapping, i.e., givena category it provides a list of websites that belongto that category.

HTTPS and Server Name Indication.HTTPS enhances HTTP with the Transport Layer

Security (TLS) protocol. TLS provides a secure pipeto a server that is usually authenticated via an X.509certificate. The secure pipe is established via a TLShandshake – a procedure that allows the client andthe server to establish cryptographic keys to encryptand authenticate data exchanged through the pipe.

Given the ever increasing awareness on the privacyissues of HTTP, major web stakeholders are mandat-ing secure (i.e., HTTPS) connections to serve theirwebsites [3, 4]. Furthermore the ToR project and EFFpromote the HTTPS Everywhere extension [5, 6], thatautomatically redirects browsers to the HTTPS ver-sion of a website when available. One goal of thiscollective effort towards an HTTPS web is to increaseonline privacy with respect to network eavesdroppers.HTTPS ensures that a user is connected to the legiti-mate webserver and that the exchanged informationcannot be eavesdropped by third parties. 1

Server Name Indication (SN) is an extension of theTLS protocol by which a client specifies the hostnameit is attempting to connect in the client hello mes-sage (the first message of a TLS handshake). Theextension is widely used by modern browsers and al-lows a server to serve multiple HTTPS websites, eachwith its own X.509 certificate, from the same IP ad-dress. The SN is, therefore, sent in cleartext and canbe eavesdropped by any party tapping on the networkbetween the client and the server.

System Model.We consider a network eavesdropper that tries to

profile users tapping on their network connections.We assume an HTTPS everywhere web where usersconnect to any website via HTTPS. The networkeavesdropper, therefore, does not have access to thecleartext traffic exchanged between the user browserand the webservers but only sees encrypted flows.However, we assume the eavesdropper can infer thehostname requested by the user by looking at theSN in the client hello message. In case SN is not

1Security guarantees of HTTPS do not take into accountphishing attacks or flaws in the public key certificationsystem.

used, client queries to DNS (recall that DNS has noprovisions for confidentiality) or simply a whois onthe destination IP address may reveal the hostnamerequested by the user.

We simplify the structure of a website and the userbrowsing behaviour as follows. Each website has amain page and a set of 1-st level pages (i.e., the pageslinked on the main page). We do not consider pages ofthe website beyond the ones linked on the main page,but our results can be easily generalized to account formore complex website structures. Similar to previouswork [7, 8, 9, 10, 11], we assume a user visits one pageat a time for each domain. 2 This could be eitherthe main page, or any of the 1-st level pages. Theeavesdropper tries to infer the page visited by theuser and assigns to her profile the corresponding setof categories according to Google AdWords.

3. USER PROFILING BY SERVER NAMELooking at the SN in the client hello message, a

basic eavesdropper learns the website a user is brows-ing and assigns the categories of the main page to theuser profile. If the user were actually browsing a pagedifferent from the main one, the profile built by thebasic eavesdropper may not be accurate.

A first step towards understanding the accuracyof user profiling in an HTTPS everywhere web con-sists in assessing the difference between the cate-gories of a website main page (e.g., the categoriesof www.nbcnews.com/) and the ones of any of its 1-stlevel pages (e.g, the categories of www.nbcnews.com/politics/).

In this experiment we have collected the list of topwebsites returned by AdWords for each of its 24 firstlevel categories.Within each list, we have selected the100 most popular websites based on their rank inAlexa [12]. For each of the resulting 2.4K websites wehave fetched the URL of all the links available on themain page that remains within the same host. Wedid not consider external links like the ones to CDNs.Each of the collected URLs (totalling to more than110K URLs) was submitted to the AdWords DisplayPlanner to obtain its set of categories.

For each of the 24 top level categories of AdWords,Figure 1 shows the distribution of the Jaccard indexamong the categories assigned to the main page of awebsite and the categories assigned to its 1-st levelpages. A Jaccard index close to 1 means that simplyassigning the categories of the main page to a usercreates a quite accurate profile, regardless of the actualpage the user is browsing within that website. A

2Discerning traffic when multiple pages of a domain arefetched simultaneously using HTTPS, remains an openproblem.

2

Page 3: User profiling in the time of HTTPS - Laoutarislaoutaris.info/wp-content/uploads/2016/08/imc2016-paper69.pdf · HTTPS and Server Name Indication. HTTPS enhances HTTP with the Transport

Shopping

Science

Computers &

Electro

nics

Business

&Industr

ial

Online Communitie

sNew

s

Internet

&Tele

com

Pets&

Animals

Beauty&

Fitness

Finance

Arts&

Entertainment

Travel

Hobbies&

Leisure

Books &Lite

rature

Home &Garden

Food &Drin

k

Law&

Government

Jobs &Educatio

n

Autos &Vehicle

s

Referen

ce

People&

Society

Sports

Real Estate

Games

0.0

0.2

0.4

0.6

0.8

1.0

Jacc

ard

Ind

ex

Figure 1: Distribution of the Jaccard index among the categories of the main page and the 1st-level pages.

Jaccard index close to 0 means that the same profiletechnique may lead to a less accurate user profile.

Figure 1 shows a great variance depending on themain category of the website. Users visiting Sports,Real Estate or Games websites could be profiled veryaccurately only by knowing the website their are con-nected to. However, when a user visits any pagewithin a website related to, e.g., Shopping, Comput-ers & Electronics or News, the user profile built byassigning her the categories of the main page is likelyto be inaccurate.

4. TRAFFIC FINGERPRINTINGIn this section we show how to improve profiling

accuracy by guessing the exact page a user is browsingusing traffic fingerprinting. Traffic fingerprinting [7, 8,9, 10] is an active research area on techniques to inferinformation (such as the visited page on an encryptedconnection) by solely observing traffic patterns at thenetwork/transport level.

Fingerprinting involves a training phase duringwhich the adversary builds a fingerprint of each of themonitored pages. This is accomplished by fetchingmultiple times the monitored pages and recording fea-tures of the generated traffic such as packet size orinter-arrival times. Later, the adversary eavesdropson the client’s connection, extracts the same featuresfrom the client’s traffic, and tries to match the clienttrace to one of the fingerprints computed during thetraining phase. Differences between the training dataand the client (or test) data, due to, e.g., differentroutes or congestions are mitigated using statisticalmethods.

We use and adapt to our scenario the fingerprintingtechnique of [11] – the most accurate web fingerprint-ing framework to date – that uses as features the sizeand the direction of each packet of a TCP connection.The classifier is, therefore, robust against differencesin bandwidth or congestions along the route. The au-thors of [11] show that page fingerprinting is hard inan open-world scenario in which the client can browseany page outside of the set monitored by the eaves-dropper. We show that webpage fingerprinting can

amazon.co

m

rakuten.co

m

aarp.co

m

wonderhow

to.com

about.com

mashable.

com

slash

dot.org

nbcnew

s.com

reuter

s.com

0.0

0.2

0.4

0.6

0.8

1.0

Cla

ssifi

erA

ccura

cy

PC(No Cache)

PC(Cache)

MobileDevice

Figure 2: Accuracy of the classifier when fetchingpages from a PC (with and without cache) and froma Mobile Device.

be reasonably accurate in a closed-world scenario inwhich the eavesdropper monitors all the pages that theclient can possibly visit. This assumption is realisticin our settings because the eavesdropper knows thewebsite requested by the user (by looking at the SNin the client hello message) and must infer whichpage she is browsing within this particular website.

The features we extract from the traffic generatedby downloading a page include: the number of in-coming packets and the number of outgoing ones,the total size of incoming packets and the total sizeof outgoing ones, and a trace defined over the sizeand the order of the observed packets.3 We usean SVM classifier with an RBF kernel parametrizedwith γ ∈ [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000] andc ∈ [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]. For eachmonitored website, we capture with tcpdump the traf-fic generated by fetching each of the 1-st level pages50 times and measure the accuracy of the classifierusing 10-fold cross validation.

Classifier Accuracy.For this experiment we pick 9 websites that have

low Jaccard index between the main page and the 1-st

3Given the space constraints, we refer the reader to [11]for the details on the feature selection process.

3

Page 4: User profiling in the time of HTTPS - Laoutarislaoutaris.info/wp-content/uploads/2016/08/imc2016-paper69.pdf · HTTPS and Server Name Indication. HTTPS enhances HTTP with the Transport

level pages (see Figure 1). For each website we trainthe classifier and test its accuracy in thee differentscenarios. We use a PC with Mozzilla Firefox withand without cache, and a mobile device with GoogleChrome with cache enabled. In the latter scenario weuse the Android emulator to fetch the pages from anemulated Nexus 5 using the build-in feature of theemulator to simulate the conditions of a 3G network.

Figure 2 shows the accuracy of the classifier for the9 websites in each one of the aforementioned scenar-ios. We found the lower accuracy for the PC withcache scenario when predicting pages of aarp.com(0.79) while we experienced the highest accuracy foramazon.com (0.97). Caching inevitably hinders theaccuracy of the classifier by 10.3% on average, butthe average accuracy never drops below 0.48. Theaccuracy decreases because when parts of a page arein the local cache, the traffic trace available to theclassifier becomes shorter and, therefore, more likelyto be confused with that of another page.4 The mobilephone scenario suffers from a similar issue, not due tocaching only, but also because mobile versions of a siteare typically simpler than their desktop counterparts,and thus they end up producing more similar traffictraces.

From Page Prediction to User Profiling.In this experiment we take a closer look at the effect

of the classifier accuracy on the quality of the userprofiles built by the eavesdropper.

Figure 3a shows the confusion matrix for edition.cnn.com where pages are sorted lexicographically basedon their URL. For the same website and the samesorting of its pages, the matrix in Figure 3b showsthe Jaccard index between any pair of 1st-level pages.Due to the sorting, pages under the same branch ofthe website, say edition.cnn.com/style appear se-quentially, in both matrices. Figure 3a shows thatwhen the classifier makes a mistake, the output pagetends to be close to the correct one. For exampleedition.cnn.com/style/arts is often mis-classifiedas edition.cnn.com/style/fashion and viceversa.This is because the features we use to train the classi-fier look at the structure of a page (e.g., the numberand position of textboxes) rather than its content (e.g.,the actual text). Therefore, when pages within thesame branch of a website share a similar structure, weexperience classification mistakes similar to the onesof Figure 3a.

When mis-classification happens, the amount ofdamage to user profiling accuracy depends on whetherthe categories of the true page and the categories of the

4In the extreme case of a page whose elements are all in thecache, the resulting trace becomes totally indistinguishablefrom that of any other fully cached page.

page output by the classifier overlap or not. For exam-ple, because of their similar structure, edition.cnn.com/asia/ is likely to be predicted as edition.cnn.

com/africa/ by the classifier (see box 1 in Figure 3a);however, given that the set of their categories is verysimilar (see box 1 in Figure 3b), the mistake of theclassifier has very little impact on the quality of userprofiling. Of course this is not always the case. Forexample the pages under edition.cnn.com/style/

(see box 2 in Figure 3a) are likely to be confused withone another by the classifier. This, however, leadsto high profiling error because different pages underthe “style” branch of the website have little overlapin term of categories (see box 2 in Figure 3b).

Figure 4 depicts the performance of the basic andthe advanced profiling techniques when monitoringthe 9 websites of Section 3. Dashed bars show theprecision and recall of the basic profiling techniquedescribed in Section 3. Solid bars show the precisionand recall of the advanced profiling mechanism thatleverages the web fingerprinting technique describedabove. User profiling leveraging web fingerprintingclearly outperforms the basic profiling technique.

Classifier Freshness.The difference between the time when the classifier

is trained and the time when pages are predictedmay affect the prediction accuracy. This is especiallytrue for very dynamic websites (e.g., news or onlinecommunity websites). In this experiment we discretizetime in epochs and we assume that website contentonly change from one epoch to the next one. If thetrain and the test data are collected in the same epoch,we say that the classifier is fresh; otherwise we saythat the classifier is stale. We define epochs as days.We train the classifier over a snapshot of the websiteon a given day, and we try to predict pages fetchedthroughout the following 6 days.

We expect a sensible difference in accuracy betweena stale classifier and a fresh one for dynamic pageswhere content changes every day (e.g., news websites).In the case of websites with static content, the dif-ference between a stale classifier and a fresh one areexpected to be less pronounced. To verify this, we add4 websites with mostly static content (2 corporate and2 academic ones) to the the 9 websites of the previousexperiments.

Figure 5 shows the effect of staleness on the accu-racy of the classifier for both a stateful and a dynamicwebsite. The dashed lines represent the percentage of1st-level pages that remain linked in the main pageacross days, while the solid ones represent the accu-racy of the classifier. We observe the accuracy of theclassifier for the dynamic website decreases rapidlywhile the accuracy for the static one decreases slowly

4

Page 5: User profiling in the time of HTTPS - Laoutarislaoutaris.info/wp-content/uploads/2016/08/imc2016-paper69.pdf · HTTPS and Server Name Indication. HTTPS enhances HTTP with the Transport

Predicted URL

Act

ual

UR

L

2

1

0

10

20

30

40

50

60

70

80

90

100

%of

pre

dic

tion

s

(a)

Predicted URL

Act

ual

UR

L

2

1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Jaca

rdIn

dex

(b)

Figure 3: Confussion matrix (a) of the classifierand the Jaccard index between the categories as-signed to two different pages (b) for edition.cnn.com.Box number 1 highlights URLs of the type edi-tion.cnn.com/[region]. Box number 2 shows URLsunder the branch edition.cnn.com/style/

during the first two days and then stabilizes around80% accuracy. For both lines, the shadows denote theminimum and the maximum of the statistics.

5. OPTIMIZING BANDWIDTH USEIn a real-world deployment, the eavesdropper may

not have the bandwidth required to refresh the classi-fier of each monitored website at every epoch. In thefollowing we formulate an optimization problem formaximizing the profiling quality given a bandwidthconstraint.

We consider an eavesdropper that monitors a corpusof n websites w1, . . . , wn. Website wi has a main pagepi0 and si 1-st level pages pi1, . . . , p

isi . We also use

c(pij) to denote the set of categories of page pij . When

browsing website wi, the user may visit any page pij ,with j = 0, . . . , si. Since the connection is encrypted,we do not make any assumption on which are the

amazon.co

m

rakuten

.com

aarp.co

m

wonderhow

to.co

m

about.com

mashable.

com

slash

dot.org

nbcnew

s.com

reuter

s.com

0.0

0.2

0.4

0.6

0.8

1.0

BasicRecall

BasicPrecision

AdvancedRecall

AdvancedPrecision

Figure 4: Precision and recall of the baseline eaves-dropper and the eavesdropper leveraging website fin-gerprinting.

0 1 2 3 4 5 6

Days after training

0.0

0.2

0.4

0.6

0.8

1.0

Cla

ssifi

erA

ccu

racy

nbcnews.comClassifier Accuracy

Stable URLs

bu.eduClassifier Accuracy

Stable URLs

Figure 5: Accuracy of the classifier days after trainingfor a dynamic website (nbcnews.com) and a static one(bu.edu).

most popular pages within wi. If the user visits pagepij , the correct categories that should be assigned to

that user when browsing wi are, therefore, c(pij). We

consider any category in c(pij) that the profiler assignsto that user as a true positive. Similarly, any categorynot in c(pij) that the profiler assigns to that user is afalse positive.

The basic eavesdropper of Section 3 learns wi fromthe client hello message issued by the user browserand assigns the categories of the main page c(pi0) tothat user. However, because of HTTPS, the baselineprofiling system cannot tell which page pij was visited.

If we denoted with T i and F i the true positive andthe false positive, respectively, we have:

• T i = 1si+1

∑j=0..si

|c(pi0) ∩ c(pij)|

• F i = 1si

∑j=1..si

|c(pj) \ c(p0)|

The advanced eavesdropper of Section 4 tries toinfer the page pij the user has fetched by looking atthe encrypted traffic trace. This is done by meansof a classifier trained on a snapshot of wi. As shownin the previous section, the freshness of the snapshotused to train the classifier impacts on its accuracy.

5

Page 6: User profiling in the time of HTTPS - Laoutarislaoutaris.info/wp-content/uploads/2016/08/imc2016-paper69.pdf · HTTPS and Server Name Indication. HTTPS enhances HTTP with the Transport

We denote the expected number of true positivesand false positives with a classifier that is ti epochsstale by T i

ti and F iti , respectively. Thus we have:

• T iti =

∑j=0..si

π(pij , pij)|c(pij)|+∑

j=0..si

∑l=0..si, l 6=j π(pij , p

il)|(c(pij) ∩ c(pil))|

• F iti =

∑j=0..si

∑l=0..si, l 6=j π(pij , p

il)|c(pil)\c(pij)|,

where π(pij , pil) denotes the probability of predict-

ing page pij as pil (depends on the freshness of theclassifier).

Given the expected number of true and false posi-tives, we set B as the bandwidth budget made avail-able to eavesdropper at every epoch, and bi as thebandwidth required to refresh the classifier for webistewi. We also denote by ui the popularity of website wi

(i.e., the number of users that visit wi in an epoch).Upon every epoch, the eavesdropper decides to

spend the budget B by training classifiers on a freshsnapshots of a subset X of the monitored websites. Ifwebsite wi is included in X, the available budget isreduces by bi and the expected number of correct cat-egories assigned is ui · T i

0 , while the expected numberof categories miss-assigned is ui · F i

0. If website wi isnot included in X, the budget remains untouched andthe expected number of correctly assigned and mis-assigned categories is ui · T i

ti and ui · F iti , respectively,

assuming the most recent classifier for wi is ti epochsstale.

The selection of X, therefore, tries to maximize thenumber of true positive and to minimize the number offalse negative, while respecting the available budget.

Select X ⊆ {1, . . . , n}

s.t. Max∑i∈X

ui(T i0 −F i

0) +∑i/∈X

ui(T iti −F

iti)

Where∑i∈X

bi ≤ B

If a classifier were never trained on wi we fall-backto the profiling technique of the naıve eavesdropperso that the number of true positives is ui · c(pi0) andthe number of false positive is |c(pij) \ c(pi0)|.

The above problem resembles the well-known 0/1knapsack problem with the only difference that itemsthat are not selected add a non-zero value to the totalgain.

A toy example.To illustrate the workings and the value of the above

optimization we have conducted a simulation basedon the 15 websites from previous sections. We em-pirically assessed the training bandwidth requirementand probabilities of the confusion matrices, while weused Alexa to obtain the popularity of each website.

0.0 0.2 0.4 0.6 0.8 1.0

Days

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15

aarp.comabout.com

amazon.comchron.com

cnn.comnbcnews.comrakuten.comreuters.com

wonderhowto.commashable.com

slashdot.orgbu.edu

nec.comtelefonica.es

uc3m.es

500mb

0 5 10 15

2Gb

Figure 6: Output of the optimization problem across15 days for 2 different bandwidth budgets (500mband 2Gb). A black box represents a website whichclassifier must be refreshed on that day.

In Figure 6 we use a black box to mark a domainthat is being selected for re-training on a particularday. We show which domains get to be classified everyday under two different budgets – 500mb and 2Gb,representing 10% and 40%, respectively of the budgetneeded to re-classify all sites every day.

We observe that bandwidth availability can stronglyaffect the daily classification pattern. In case of 500mbbudget, the same set of websites gets to be picked forclassification every day. In the case of 2Gb, however,different websites get to compete for the availablebudget and thus end up being picked or skipped ondifferent days. The actual resulting pattern dependson the interplay between website popularity, size, anddynamicity of content.

The small number of websites in the above exampledoes not leave a lot of margin for profiling performancedifference between optimizing only once vs optimiz-ing every day. We have, however, simulated a largerexample that includes 200 pages with a mix of popular-ities, content dynamicity, and size and have observedthat in more complex settings the difference betweenoptimizing only once vs. every day is substantial.

6. CONCLUSIONSTo the best of our knowledge our study is the first

one to demonstrate that network eavesdroppers canprofile user interests despite HTTPS. We have shownthat even off-the-shelf traffic classification algorithmscan guess the page that a user is viewing. Caching anddynamic content tailored to the device capabilitiesmake the whole effort harder but the obtained accu-racy remains high. We believe that more specializedclassification algorithms, coupled with careful optimi-sation of classification bandwidth can yield accurateand scalable user profiling even in more complex set-tings than the ones we have considered. We planto prove our claim by developing a fully functioningprototype.

6

Page 7: User profiling in the time of HTTPS - Laoutarislaoutaris.info/wp-content/uploads/2016/08/imc2016-paper69.pdf · HTTPS and Server Name Indication. HTTPS enhances HTTP with the Transport

7. REFERENCES[1] J. M. Carrascosa, J. Mikians, R. Cuevas,

V. Erramilli, and N. Laoutaris, “I always feellike somebodys watching me measuring onlinebehavioural advertising,” in Proc. of ACMCoNEXT’15.

[2] “Display Planner basics.” https://support.

google.com/adwords/answer/3056115?hl=en.”[Online; accessed 12-May-2016]”.

[3] “SSL compliance.” https://support.google.

com/richmedia/answer/6015286?hl=en.”[Online; accessed 12-May-2016]”.

[4] M. Belshe, R. Peon, and M. Thomson,“Hypertext transfer protocol version 2 (http/2),”RFC 7540, RFC Editor, May 2015. http://www.rfc-editor.org/rfc/rfc7540.txt.

[5] “HTTPS Everywhere.”https://addons.mozilla.org/en-US/

firefox/addon/https-everywhere/. ”[Online;accessed 12-May-2016]”.

[6] R. Dingledine, N. Mathewson, and P. Syverson,“Tor: The second-generation onion router,” tech.rep., DTIC Document, 2004.

[7] T.-F. Yen, Y. Xie, F. Yu, R. P. Yu, andM. Abadi, “Host fingerprinting and tracking onthe web: Privacy and security implications.,” inProc. of NDSS’12.

[8] A. Hintz, “Fingerprinting websites using trafficanalysis,” in Proc. of PETS’02.

[9] D. Herrmann, R. Wendolsky, and H. Federrath,“Website fingerprinting: attacking popularprivacy enhancing technologies with themultinomial naıve-bayes classifier,” in Proc. ofCCSW’09.

[10] A. Panchenko, L. Niessen, A. Zinnen, andT. Engel, “Website fingerprinting in onionrouting based anonymization networks,” in Proc.of ACM WPES’11.

[11] A. Panchenko, F. Lanze, A. Zinnen, M. Henze,J. Pennekamp, K. Wehrle, and T. Engel,“Website fingerprinting at internet scale,” in Proc.of NDSS’16.

[12] “Alexa Top Sites.”http://www.alexa.com/topsites. ”[Online;accessed 12-May-2016]”.

7