GDG DevFest Central Italy 2013 1. 2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google...
If you can't read please download the document
GDG DevFest Central Italy 2013 1. 2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)
2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google
Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google) and
the AdWords team.
Slide 3
The AdWords Problem
Slide 4
?
Slide 5
?
Slide 6
Soccer Shoes
Slide 7
The AdWords Problem Soccer Shoes
Slide 8
Google Advertisement in Numbers Over a billion of query a day.
A lot of advertisers.
www.google.com/competition/howgooglesearchworks.html
Slide 9
Challenges Several scientific and technological challenges. How
to find in real-time the best ads? How to price each ads? How to
suggest new queries to advertisers? The solution to these problems
involves some fundamental scientific results (e.g. a Nobel
Prize-winning auction mechanism)
Slide 10
Google Advertisement in Numbers 2012 Revenues: 46 billions USD
95% Advertisement: 43 billions USD.
http://investor.google.com/financial/tables.html
Slide 11
Goals of the Project Tackling AdWords data to identify
automatically, for each advertiser, its main competitors and
suggest relevant queries to each advertiser. Goals: Useful business
information. Improve advertisement. More relevant performance
benchmarks.
Slide 12
Information Deluge Large advertisers (e.g. Amazon, Ask.com,
etc) compete in several market segments with very different
advertisers. QueryInformation Nike store New York Market Segment:
Retailer, Geo: NY (USA), Stats: 10 clicks Soccer shoes Market
Segment: Apparel, Geo: London, UK, Stats: 4 clicks Soccer ball
Market Segment: Equipment, Geo: San Franciso, CA, Stats: 5 clicks .
millions of other queries .
Slide 13
Representing the data How to represent the salient features of
the data? Relationships between advertisers and queries Statistics:
clicks, costs, etc. Take into account the categories. Efficient
algorithms.
Slide 14
Graphs: the lingua franca of Big Data Mathematical objects
studied well before the history of computers. Knigsbergs bridges
problem. Euler, 1735.
Slide 15
Graphs: the lingua franca of Big Data Graphs are everywhere!
Social Networks Technological Networks Natural Networks
Slide 16
Graphs: the lingua franca of Big Data Formal definition A B C D
A set of Nodes
Slide 17
Graphs: the lingua franca of Big Data Formal definition A B C D
A set of Edges
Slide 18
Graphs: the lingua franca of Big Data Formal definition A B C D
The edges might have a weight 1 4 2 3
Slide 19
Adwords data as a (Bipartite) Graph A lot of Advertisers
Billions of Queries Hundreds of Labels
Slide 20
Semi-Formal Problem Definition Advertisers Queries
Slide 21
Semi-Formal Problem Definition A Advertisers Queries
Slide 22
Semi-Formal Problem Definition A Advertisers Queries
Labels:
Slide 23
Semi-Formal Problem Definition A Advertisers Queries
Labels:
Slide 24
Semi-Formal Problem Definition A Advertisers Queries Labels:
Goal: Find the nodes most similar to A.
Slide 25
How to Define Similarity? Several node similarity measures in
the literature based on the graph structure, random walk, etc. What
is the accuracy? Can it scale to graphs with billions of nodes? Can
be computed in real-time?
Slide 26
The three ingredients of Big Data A lot of data A sophisticated
infrastructure: MapReduce Efficient algorithms: Graph mining
Slide 27
MapReduce
Slide 28
The work is spread across several machines in parallel
connected with fast links.
Slide 29
Algorithms Personalized PageRank: Random walks on the graph
Closely related to the celebrated Google PageRank.
Slide 30
Personalized PageRank
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Idea: perform a very long random walk (starting from v). Rank
nodes by probability of visit assigns a similarity score to each
node w.r.t. node v. Strong community bias (this can be
formalized).
Slide 44
Personalized PageRank Exact computation is unfeasible O(n^3),
but it can be approximated very well. Very efficient Map Reduce
algorithm scaling to large graphs (hundred of millions of nodes)
However
Slide 45
Algorithmic Bottleneck Our graphs are simply too big (billions
of nodes) even for large-scale systems. MapReduce is not real-time.
We cannot precompute the results for all subsets of categories
(exponential time!).
Slide 46
1 st idea: Tackling Real Graph Structure Data size is the main
bottleneck. Compressing the graph would speed up the
computation.
Slide 47
1 st idea: Tackling Real Graph Structure abcdefg AB A B Only
advertisers. Advertisers and queries 1
Slide 48
1 st idea: Tackling Real Graph Structure abcdefg AB 1 A B
Advertisers and queries abc d e f g A B Ranking of the entire graph
2 Only advertisers.
Slide 49
1 st idea: Tackling Real Graph Structure Theorem: the ranking
computed is the corrected Personalized PageRank on the entire
graph. Based on results from the mathematical theory Markov Chain
state aggregation (Simon and Ado, 61; Meyer 89, etc.).
Slide 50
Algorithmic Bottleneck Our graphs are too big (billions of
nodes) even for large-scale systems. MapReduce is not real-time. We
cannot precompute the results for all subsets of categories
(exponential time!).
Slide 51
Two-stage Approach First stage: Large-scale (but feasible)
MapReduce pre-computation. Second Stage: Fast iterative
algorithm.
Slide 52
First Stage: Individual Category Rankings Advertisers
Queries
Slide 53
First Stage: Individual Category Rankings Advertisers Queries
Precomputed Rankings
Second Stage: Rank aggregation Precomputed Rankings Precomputed
Rankings Ranking of Red + Yellow A real-time iterative algorithm
aggregates the rankings of a given node for a subset of the
categories.
Slide 57
Algorithmic Bottleneck Our graphs are too big (billions of
nodes) even for large-scale systems. MapReduce is not real-time. We
cannot precompute the results for all subsets of categories
(exponential time!).
Slide 58
Experimental evaluation shows the accuracy of the results.
Fully implemented and currently under evaluation for integration in
production systems. Ongoing research project for future scientific
publications. Conclusions