1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

On Compressing Web Graphs

Michael Mitzenmacher, Harvard

Micah Adler, Univ. of Massachusetts

The Web as a Graph

Page A

Page BPage CPage D

Motivation

• The Web graph itself is interesting and useful.– PageRank / Kleinberg’s algorithm.– Finding cyber-communities.– Archival history of Web growth and development.– Connectivity server.

• Storing Web linkage information is expensive.– Web growth rate vs. storage growth rate?

• Can we compress it?

Varieties of Compression

1. Compress an isomorphism of the Web graph. Good for storage/transmission of graph features.

2. Compress the Web graph with nodes in a given order (e.g. sorted by URL).

3. Compress for use of compressed graph in a product (e.g. connectivity server).

Baseline: Huffman coding

• Significant work has shown in/outdegrees of vertices of Web graph have power-law distribution.

• Basic scheme: for each vertex, list all outedges.

• Assign Huffman codeword based on indegree.

jj ~)indegreePr(

Huffman Example

Indegrees

Codewords

Web Graph Structure

• Intuition: Huffman uses degree distribution, but not Web graph structure.

• More structure to take advantage of: Web communities.

• Many pages share links.

Reference Algorithm

• Each vertex is allowed to choose a reference vertex.• Compress by representing edges copied from

reference vertex as a bit vector.

• No cycles allowed among references.

a b c d e f

X uses YX outedges = a + ref Y [11100]

Simple Reference Algorithm

• Maximize the number of edges compressed.

• Build a related affinity graph, recording number of shared pointers.

• Find a maximum spanning tree (or forest) to find best references.

a b c d e f

Improved Reference Algorithm

• Let cost(A,B) be the cost of compressing A using B as a reference.

• Form an improved affinity graph: directed graph with costs.

• Also add a root node R, with cost(A,R) being the cost of A with no reference.

• Compute the rooted directed maximum spanning tree on directed affinity graph.

1)B()A(log)B(outdeg)BA,(cost NNn

Example

a b c d e f

n = 1024 vertices

Part of the directed affinity graph.

Complexity• Finding directed maximum spanning is fast: for x vertices and y edges, running time is O(x log x + y) or O(y log x).

• Compressing is fast given references.• Slow part is building affinity graph.

– Equivalent to sparse matrix multiplication.– If M is adjacency matrix, number of shared neighbors

found by computing MMT.– Sparseness helps, but still potentially very slow.

Building the Affinity Graph

• Approach 1: For each pair of vertices a,b, check edge list to find common neighbors.– Slow, but good with memory.

• Approach 2: For each vertex a, increase count for each pair b,c of vertices with edges to a.– Quicker, but a potential memory hog.– Parallelizable.– Complexity:

2))(indeg( Va aO

Variations• Huffman code non-referenced edges.

– Using non-Huffman weights to find references is no longer optimal.

– But do not know Huffman weights until references found.

• Huffman/run length/otherwise encode bit vectors.

• Bound the depth of tree.

• Find multiple references.

Bounded Tree Depth

• For computing on compressed form of graph, do not want a long path of references.

• Potential solution: bound tree depth from root.• Problem: finding optimal tree of bounded depth is

NP-hard. – Depth 2 = Facility location problem.

• In practice: use heuristic/approximation algorithms; split full optimal tree to keep depth bound.

Multiple References

• If one reference is good, finding two could be better.

• We show finding optimal pair of references, even just to maximize number of compressed edges, is NP-hard.

• In practice: run single Reference algorithm multiple times.

Prototype

• Finds references by constructing directed affinity graph, computing directed maximum spanning tree.

• Does not output compressed form; only size of compressed form.

• Also computes Huffman and Reference + Huffman size.– Size of Huffman table not counted.

• Future work: dealing with bottleneck of computing affinity graph.

Web Graph Models

• Copy models– New pages generated dynamically– Some links are “random”-- uniform over all

vertices– Some links are copies: choose a page you like at

random, and copy some of its links.– Richer models include deletions, changing links,

inedges at creation. – Results in power-law distribution.

Copy Model

Random Link

X Copiesof X links

Data for Testing

• Graphs chosen using random copy graphs.

• TREC8 WT2g data set.

Pages Copied

Copy Prob

RandomLinks

G1 G2 G3 G4 TREC

Nodes 131,072 131,072 131,072 131,072 247,428

0.7 0.5 0.5 NA

1 1 [1,2] [0,4] NA

1 [1,2] [0,4] NA

Testing Details• Single pass: at most one reference.• 10 trials for each random graph type.

– Little variance found.

• Random graphs seeded with 1024 vertices of degree 3.

• Small graphs: edge between vertices in affinity graph if at least 2 shared edges in original. Large graphs (G3,G4,TREC): 3 shared edges.

Results

No comp.Bits, mill.

Huffman

Reference

Ref+Huff

G2 G3 G4 TREC

Avg. Deg. 2.09 3.25 5.10 10.22 4.72

4.66 7.25 11.36 22.78 21.00

Analysis of Results

• Huffman fails to capture significant structure.

• More copying leads to more compression.

• Good compression possible even with only one reference.

• Performs well on “real” Web data.– TREC database may not be representative.– Significant locality.

Contributions

• We introduce the Reference algorithm, an algorithm designed to compress Web graphs based on structural properties.

• Initial results: Reference algorithm appears very promising, better than Huffman.

• Bounded depth variations may be suitable for on-line computing (connectivity server).

• Hardness results for natural extensions of Reference algorithm.

Future Work• Beating the bottleneck: determining the affinity

graph.– Can we approximate the affinity graph and still

compress well?

• More extensive testing.– Variations: multiple passes, bounded depth.– Graphs: larger artificial and real Web graphs.

• Determining value of locality and combining locality with a reference-based scheme.

1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

Documents

Peeling Arguments Invertible Bloom Lookup Tables and Biff Codes Michael Mitzenmacher

Micah Huang Portfolio

1 Codes, Bloom Filters, and Overlay Networks Michael Mitzenmacher

Compressing PowerPoint Slides

Compressing Relations And Indexes

God's Messenger--Micah Micah 1:1-4

Micah Mortali

IgniteBoulder Micah

Micah 3 commentary

The Micah Update...Inside this issue: Message from Mark 1 Micah News 2 Mazel Tov, Micah! 3 - 5 Stay Connected 6 The Micah Update Message From Mark Dear Micah Community, The summer

Micah Davis Portfolio

Real-time HDR compressing

Boot Camp By: Micah Worek. Boot Camp By: Micah Worek

Micah Smith Behaviorism

Micah 4 commentary

Michael Mitzenmacher - Computer Sciencemichaelm/resume.pdf · Michael Mitzenmacher michaelm@eecs.harvard.edu 33 Cary Avenue, Lexington MA 02421 617-496-7172 Research Interests Design

Genesis 2 Micah 1 Micah 2 Micah 3 Micah 6 Micah 7 MICAH 1 ... · MICAH MICAH 1:1-16 NEXT 2:1-13 SYNOPSIS: The author of the Book of Micah was the prophet Micah. 1Micah was the 6th

Hashing and Packet Level Algorithms Michael Mitzenmacher

Michael Mitzenmacher Harvard University

Some Open Questions Related to Cuckoo Hashing Michael Mitzenmacher Harvard University