GDG DevFest Central Italy 2013 1. 2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)

GDG DevFest Central Italy 2013 1

2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google) and the AdWords team.

The AdWords Problem

Soccer Shoes

The AdWords Problem Soccer Shoes

Google Advertisement in Numbers Over a billion of query a day. A lot of advertisers. www.google.com/competition/howgooglesearchworks.html

Challenges Several scientific and technological challenges. How to find in real-time the best ads? How to price each ads? How to suggest new queries to advertisers? The solution to these problems involves some fundamental scientific results (e.g. a Nobel Prize-winning auction mechanism)

Google Advertisement in Numbers 2012 Revenues: 46 billions USD 95% Advertisement: 43 billions USD. http://investor.google.com/financial/tables.html

Goals of the Project Tackling AdWords data to identify automatically, for each advertiser, its main competitors and suggest relevant queries to each advertiser. Goals: Useful business information. Improve advertisement. More relevant performance benchmarks.

Information Deluge Large advertisers (e.g. Amazon, Ask.com, etc) compete in several market segments with very different advertisers. QueryInformation Nike store New York Market Segment: Retailer, Geo: NY (USA), Stats: 10 clicks Soccer shoes Market Segment: Apparel, Geo: London, UK, Stats: 4 clicks Soccer ball Market Segment: Equipment, Geo: San Franciso, CA, Stats: 5 clicks . millions of other queries .

Representing the data How to represent the salient features of the data? Relationships between advertisers and queries Statistics: clicks, costs, etc. Take into account the categories. Efficient algorithms.

Graphs: the lingua franca of Big Data Mathematical objects studied well before the history of computers. Knigsbergs bridges problem. Euler, 1735.

Graphs: the lingua franca of Big Data Graphs are everywhere! Social Networks Technological Networks Natural Networks

Graphs: the lingua franca of Big Data Formal definition A B C D A set of Nodes

Graphs: the lingua franca of Big Data Formal definition A B C D A set of Edges

Graphs: the lingua franca of Big Data Formal definition A B C D The edges might have a weight 1 4 2 3

Adwords data as a (Bipartite) Graph A lot of Advertisers Billions of Queries Hundreds of Labels

Semi-Formal Problem Definition Advertisers Queries

Semi-Formal Problem Definition A Advertisers Queries

Semi-Formal Problem Definition A Advertisers Queries Labels:

Semi-Formal Problem Definition A Advertisers Queries Labels: Goal: Find the nodes most similar to A.

How to Define Similarity? Several node similarity measures in the literature based on the graph structure, random walk, etc. What is the accuracy? Can it scale to graphs with billions of nodes? Can be computed in real-time?

The three ingredients of Big Data A lot of data A sophisticated infrastructure: MapReduce Efficient algorithms: Graph mining

MapReduce

The work is spread across several machines in parallel connected with fast links.

Algorithms Personalized PageRank: Random walks on the graph Closely related to the celebrated Google PageRank.

Personalized PageRank

Idea: perform a very long random walk (starting from v). Rank nodes by probability of visit assigns a similarity score to each node w.r.t. node v. Strong community bias (this can be formalized).

Personalized PageRank Exact computation is unfeasible O(n^3), but it can be approximated very well. Very efficient Map Reduce algorithm scaling to large graphs (hundred of millions of nodes) However

Algorithmic Bottleneck Our graphs are simply too big (billions of nodes) even for large-scale systems. MapReduce is not real-time. We cannot precompute the results for all subsets of categories (exponential time!).

1 st idea: Tackling Real Graph Structure Data size is the main bottleneck. Compressing the graph would speed up the computation.

1 st idea: Tackling Real Graph Structure abcdefg AB A B Only advertisers. Advertisers and queries 1

1 st idea: Tackling Real Graph Structure abcdefg AB 1 A B Advertisers and queries abc d e f g A B Ranking of the entire graph 2 Only advertisers.

1 st idea: Tackling Real Graph Structure Theorem: the ranking computed is the corrected Personalized PageRank on the entire graph. Based on results from the mathematical theory Markov Chain state aggregation (Simon and Ado, 61; Meyer 89, etc.).

Algorithmic Bottleneck Our graphs are too big (billions of nodes) even for large-scale systems. MapReduce is not real-time. We cannot precompute the results for all subsets of categories (exponential time!).

Two-stage Approach First stage: Large-scale (but feasible) MapReduce pre-computation. Second Stage: Fast iterative algorithm.

First Stage: Individual Category Rankings Advertisers Queries

First Stage: Individual Category Rankings Advertisers Queries Precomputed Rankings

First Stage: Individual Category Rankings Advertisers Queries Precomputed Rankings Precomputed Rankings

First Stage: Individual Category Rankings Advertisers Queries Precomputed Rankings Precomputed Rankings Precomputed Rankings

Second Stage: Rank aggregation Precomputed Rankings Precomputed Rankings Ranking of Red + Yellow A real-time iterative algorithm aggregates the rankings of a given node for a subset of the categories.

Algorithmic Bottleneck Our graphs are too big (billions of nodes) even for large-scale systems. MapReduce is not real-time. We cannot precompute the results for all subsets of categories (exponential time!).

Experimental evaluation shows the accuracy of the results. Fully implemented and currently under evaluation for integration in production systems. Ongoing research project for future scientific publications. Conclusions

Documents

GDG DevFest Central Italy 2013 1. 2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)