1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

Preview:

Citation preview

1

On Compressing Web Graphs

Michael Mitzenmacher, Harvard

Micah Adler, Univ. of Massachusetts

2

The Web as a Graph

Page A

Page BPage CPage D

A

B C D

3

Motivation

• The Web graph itself is interesting and useful.– PageRank / Kleinberg’s algorithm.– Finding cyber-communities.– Archival history of Web growth and development.– Connectivity server.

• Storing Web linkage information is expensive.– Web growth rate vs. storage growth rate?

• Can we compress it?

4

Varieties of Compression

1. Compress an isomorphism of the Web graph. Good for storage/transmission of graph features.

2. Compress the Web graph with nodes in a given order (e.g. sorted by URL).

3. Compress for use of compressed graph in a product (e.g. connectivity server).

5

Baseline: Huffman coding

• Significant work has shown in/outdegrees of vertices of Web graph have power-law distribution.

• Basic scheme: for each vertex, list all outedges.

• Assign Huffman codeword based on indegree.

jj ~)indegreePr(

6

Huffman Example

Indegrees

1

3

2

3

1

1

1

Codewords

100

01

001

11

0000

0001

101

7

Web Graph Structure

• Intuition: Huffman uses degree distribution, but not Web graph structure.

• More structure to take advantage of: Web communities.

• Many pages share links.

A

C D E

B

F

8

Reference Algorithm

• Each vertex is allowed to choose a reference vertex.• Compress by representing edges copied from

reference vertex as a bit vector.

• No cycles allowed among references.

X Y

a b c d e f

X uses YX outedges = a + ref Y [11100]

9

Simple Reference Algorithm

• Maximize the number of edges compressed.

• Build a related affinity graph, recording number of shared pointers.

• Find a maximum spanning tree (or forest) to find best references.

X Y

a b c d e f

X Y3

10

Improved Reference Algorithm

• Let cost(A,B) be the cost of compressing A using B as a reference.

• Form an improved affinity graph: directed graph with costs.

• Also add a root node R, with cost(A,R) being the cost of A with no reference.

• Compute the rooted directed maximum spanning tree on directed affinity graph.

1)B()A(log)B(outdeg)BA,(cost NNn

11

Example

A B

a b c d e f

A B

n = 1024 vertices

25

34

40 50

Part of the directed affinity graph.

R

12

Complexity• Finding directed maximum spanning is fast: for x vertices and y edges, running time is O(x log x + y) or O(y log x).

• Compressing is fast given references.• Slow part is building affinity graph.

– Equivalent to sparse matrix multiplication.– If M is adjacency matrix, number of shared neighbors

found by computing MMT.– Sparseness helps, but still potentially very slow.

13

Building the Affinity Graph

• Approach 1: For each pair of vertices a,b, check edge list to find common neighbors.– Slow, but good with memory.

• Approach 2: For each vertex a, increase count for each pair b,c of vertices with edges to a.– Quicker, but a potential memory hog.– Parallelizable.– Complexity:

2))(indeg( Va aO

14

Variations• Huffman code non-referenced edges.

– Using non-Huffman weights to find references is no longer optimal.

– But do not know Huffman weights until references found.

• Huffman/run length/otherwise encode bit vectors.

• Bound the depth of tree.

• Find multiple references.

15

Bounded Tree Depth

• For computing on compressed form of graph, do not want a long path of references.

• Potential solution: bound tree depth from root.• Problem: finding optimal tree of bounded depth is

NP-hard. – Depth 2 = Facility location problem.

• In practice: use heuristic/approximation algorithms; split full optimal tree to keep depth bound.

16

Multiple References

• If one reference is good, finding two could be better.

• We show finding optimal pair of references, even just to maximize number of compressed edges, is NP-hard.

• In practice: run single Reference algorithm multiple times.

17

Prototype

• Finds references by constructing directed affinity graph, computing directed maximum spanning tree.

• Does not output compressed form; only size of compressed form.

• Also computes Huffman and Reference + Huffman size.– Size of Huffman table not counted.

• Future work: dealing with bottleneck of computing affinity graph.

18

Web Graph Models

• Copy models– New pages generated dynamically– Some links are “random”-- uniform over all

vertices– Some links are copies: choose a page you like at

random, and copy some of its links.– Richer models include deletions, changing links,

inedges at creation. – Results in power-law distribution.

19

Copy Model

Random Link

X Copiesof X links

X

20

Data for Testing

• Graphs chosen using random copy graphs.

• TREC8 WT2g data set.

Graph

Pages Copied

Copy Prob

RandomLinks

G1 G2 G3 G4 TREC

Nodes 131,072 131,072 131,072 131,072 247,428

0.5

1

0.7 0.5 0.5 NA

1 1 [1,2] [0,4] NA

1 [1,2] [0,4] NA

21

Testing Details• Single pass: at most one reference.• 10 trials for each random graph type.

– Little variance found.

• Random graphs seeded with 1024 vertices of degree 3.

• Small graphs: edge between vertices in affinity graph if at least 2 shared edges in original. Large graphs (G3,G4,TREC): 3 shared edges.

22

Results

Graph

No comp.Bits, mill.

Huffman

Reference

Ref+Huff

G2 G3 G4 TREC

Avg. Deg. 2.09 3.25 5.10 10.22 4.72

87.75

88.68

81.58

83.93

67.49

63.63

85.15

69.96

65.35

79.47

61.65

54.13

83.31

49.15

46.36

4.66 7.25 11.36 22.78 21.00

G1

23

Analysis of Results

• Huffman fails to capture significant structure.

• More copying leads to more compression.

• Good compression possible even with only one reference.

• Performs well on “real” Web data.– TREC database may not be representative.– Significant locality.

24

Contributions

• We introduce the Reference algorithm, an algorithm designed to compress Web graphs based on structural properties.

• Initial results: Reference algorithm appears very promising, better than Huffman.

• Bounded depth variations may be suitable for on-line computing (connectivity server).

• Hardness results for natural extensions of Reference algorithm.

25

Future Work• Beating the bottleneck: determining the affinity

graph.– Can we approximate the affinity graph and still

compress well?

• More extensive testing.– Variations: multiple passes, bounded depth.– Graphs: larger artificial and real Web graphs.

• Determining value of locality and combining locality with a reference-based scheme.