Publishing Search Query logs

Lecture 17: 590.03 Fall 12 1

CompSci 590.03Instructor: Ashwin Machanavajjhala

Publishing Search Query logs

Lecture 17: 590.03 Fall 12 2

Outline• Uses of search query logs

• Privacy and search logs

• K-Anonymity

• Differentially private agorithms

Lecture 17: 590.03 Fall 12 3

Search Query Log• <anonid, query, querytime, itemrank, clickurl>

Lecture 17: 590.03 Fall 12 4

Uses of search query logs• Search result caching

• Query Recommendation

• Synonym identification

• Reranking search results

• Search advertising and Keyword popularity estimation

Lecture 17: 590.03 Fall 12 5

Uses of search query logs[Silvestri FnT ‘10]

Lecture 17: 590.03 Fall 12 6

Google Flu

“We've found that certain search terms are good indicators of flu activity. Google Flu Trends uses aggregated Google search data to estimate current flu activity around the world in near real-time.” http://www.google.org/flutrends/

• Predictions by Google Flu are 1-2 weeks ahead of CDC’s ILI (Influenza-like illness) surveillance reports

[Ginsberg Nature ‘09]

http://www.google.org/flutrends/

http://www.google.org/flutrends/

Lecture 17: 590.03 Fall 12 7

Google Flu• Identify single ILI-related search queries that could most

accurately model the CDC ILI visit percentages in 9 regions

– P: probability of a ILI-related physician visit in a region (based on CDC data)– Q: ILI-related query fraction

• Pick 45 highest scoring queries, and fit a linear model to predict ILI visit rates.

[Ginsberg Nature ‘09]

Lecture 17: 590.03 Fall 12 8

Google Flu[Ginsberg Nature ‘09]

Lecture 17: 590.03 Fall 12 9

Flu Trends

U.S.

Australia

Google Flu Estimate National Data

Lecture 17: 590.03 Fall 12 10



• K-Anonymity


Lecture 17: 590.03 Fall 12 11

Privacy and Search Logs[NYTimes 2006]

Lecture 17: 590.03 Fall 12 12

Sensitive Information

• Obtain sensitive information directly from queries (user1)• Identifying users via demographic attributes (user2)• Identifying users by following urls (user3)• Identification leads to learning sensitive queries (user2)

[Chen et al FnT ‘09]

Lecture 17: 590.03 Fall 12 13

Challenges• Not clear which queries are identifying and which queries are

sensitive

• Users queries are almost always unique

• Adversaries may launch active attacks– Create many queries from different accounts to test if some user search for

sensitive queries.

Lecture 17: 590.03 Fall 12 14



• Privacy-Enhancing Techniques


Lecture 17: 590.03 Fall 12 15

Identifier Deletion• Delete personally identifying information like IP addresses and

cookies, names, social security numbers …

• Even if we remove age, gender, zip code from search logs, one can estimate these from the remaining log [Jones et al ‘07]

Lecture 17: 590.03 Fall 12 16

Hashing Queries• Replace queries with hash values

• One can estimate the words based on co-occurrence analysis if token based hashing schemes are used [Kumar et al WWW ‘07]

• Utility is lost …

Lecture 17: 590.03 Fall 12 17

K-anonymity and deleting infrequent queries

• Unlikely that many people search for a specific individual’s identifiers.

• Algorithm: Suppress all queries which are posed by at most K users.

• … But a combination of frequent queries can still identify an individual …

• Solution: Split a users log into smaller ones … based on query sessions– Query session is a set of queries that

are related to each other.

[Adar WWW 07]

Lecture 17: 590.03 Fall 12 18

TrackMeNot• Users send noise queries in addition to real queries

• TrackMeNot is a browser plugin which posts queries to search engines

Problems: • Distribution of noisy queries is different from distribution of

actual queries … so noise can be removed• Imposes load on the search engine • Query log loses utility …

[Howe & Nissbaum 08]

Lecture 17: 590.03 Fall 12 19



• Privacy-Enhancing Techniques


Lecture 17: 590.03 Fall 12 20

Differential Privacy and Search Logs• Consider two databases that differ in the log of one user

• In the worst case all queries are by the same user– Sensitivity = |query log|– Can guarantee no utility!

• Pick at most m queries from each user

Lecture 17: 590.03 Fall 12 21

Differential Privacy and Search Logs• Domain of search terms is very large. Hence no differentially

private algorithm is “useful”.

• Consider the problem of publishing all queries posted by at least τ users each.

Theorem: [Gotz et al TKDE 2012] For a sufficiently large domain size, the accuracy of any differentially private algorithm is worse than that of an algorithm with always returns an emptyset!

Lecture 17: 590.03 Fall 12 22

Probabilistic Differential Privacy

Adversary may distinguish between D1 and D2 based on a set of unlikely outputs

with probability at most δ

For every probable output

OD2D1

For every pair of inputs that differ in one value

Pr[O | < eε] > 1 - δPr[D1 O]Pr[D2 O]

Lecture 17: 590.03 Fall 12 23

Publishing Frequent queries/clicks

[Korolova et al WWW 2009,Gotz et al TKDE 2012]

Lecture 17: 590.03 Fall 12 24

Privacy• The algorithm presented in the previous slide guarantees (ε,δ)-

probabilistic differential privacy if

– Where U is the number of users, m is the maximum number of queries per user, λ is the laplace noise parameter, and τ, τ’ are the two thresholds used by the algorithm

Lecture 17: 590.03 Fall 12 25

Utility• Let ξ = (τ’ – τ)/3*, and let τ* = τ + ξ

Any query that appears with frequency < τ* - ξ …• Has frequency less than τ• Is published in the output with probability 0.

Any query that appears with frequency > τ* + ξ …• Is published if τ* + ξ + Lap(λ) > τ’• That is, noise > ξ• That is, query is published with probability 1- 0.5*e-ξ/λ.

Lecture 17: 590.03 Fall 12 26

Utility[Gotz et al TKDE 2012]

Distributions are significantly

different

Lecture 17: 590.03 Fall 12 27

Web Caching scenario• Speed up web search, by storing the results for most frequent

queries.• Each keyword is given a score based on frequency in the

(anonymous) log. • Top few keywords are maintained in memory …

[Gotz et al TKDE 2012]

Lecture 17: 590.03 Fall 12 28

Summary • Publishing search logs can lead to very useful applications

– Web– Social Science– …

• Very sensitive information. Also individuals are easily identifiable.

• Simple techniques do not provide sufficient protection

• Differentially private techniques throw away a significant amount of data– Only m queries per person– All tail queries (with low frequency) are thrown away

Lecture 17: 590.03 Fall 12 29

ReferencesF. Silvestri, “Mining Query Logs: Turning Search Usage Data into Knowledge”, Foundations

and Trends 4 (1-2) 2010J. Ginsberg, M. Mohebbi, R. Patel, L. Brammer, M. Smolinski, L. Brilliant, “Detecting influenza

epidemics using search engine query data”, Nature, vol. 457, Feb 2009Bee-Chung Chen, Daniel Kifer, Kristen LeFevre and Ashwin Machanavajjhala "Privacy-

Preserving Data Publishing", Foundations and Trends® in Databases: Vol. 2: No 1-2, pp 1-167, 2009.

R. Jones, R. Kumar, B. Pang, and A. Tomkins, “I know what you did last summer — query logs and user privacy,” in CIKM, 2007.

R. Kumar, J. Novak, Bo. Pang, and A. Tomkins, “On anonymizing query logs via token-based hashing,” WWW 2007

E. Adar, “User 4xxxx9: Anonymizing query logs”, WWW 2007HOWE, D. AND NISSENBAUM, H. 2008. TrackMeNot: Resisting surveillance in web search.A. Korolova, K. Kenthapadi, N. Mishra, A. Ntoulas, “Releasing Search Queries and Clicks

Privatey”, WWW 2009M. Gotz, A. Machanavajjhala, G. Wang, X. Xiao, J. Gehrke, “Publishing Search Logs”, TKDE

2012

Documents

Publishing Search Query logs