HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Preview:

DESCRIPTION

Google’s original use case for BigTable was the storage and processing of web graph information, represented as sparse matrices. However, many organizations tend to treat HBase as merely a “web scale” RDBMS. This session will cover several use cases for storing graph data in HBase, including social networks and web link graphs, MapReduce processes like cached traversal, PageRank, and clustering and lastly will look at some lower-level modeling details like row key and column qualifier design, using FullContact’s graph processing systems as a real-world use case.

Citation preview

Storing and Manipulating Graphs in HBase

Dan Lynndan@fullcontact.com

@danklynn

Keeps Contact Information Current and Complete

Based in Denver, Colorado

CTO & Co-Founder

Turn Partial Contacts Into Full Contacts

Refresher: Graph Theory

Refresher: Graph Theory

Refresher: Graph Theory

Vertex

Refresher: Graph Theory

Edge

Social Networks

Tweets

@danklynn

@xorlev

“#HBase rocks”

author

follows

retweeted

Web Links

http://fullcontact.com/blog/

http://techstars.com/

<a href=”...”>TechStars</a>

Why should you care?

Vertex Influence- PageRank

- Social Influence

- Network bottlenecks

Identifying Communities

Storage Options

neo4j

Very expressive querying(e.g. Gremlin)

neo4j

Transactional

neo4j

Data must fit on a single machine

neo4j

:-(

FlockDB

Scales horizontally

FlockDB

Very fast

FlockDB

No multi-hop query support

:-(

FlockDB

RDBMS(e.g. MySQL, Postgres, et al.)

Transactional

RDBMS

Huge amounts of JOINing

RDBMS

:-(

Massively scalable

HBase

Data model well-suited

HBase

Multi-hop querying?

HBase

Modeling Techniques

1

2

3

Adjacency Matrix

Adjacency Matrix

0 1 1

1 0 1

1 1 0

1 2 3

1

2

3

Adjacency Matrix

Can use vectorized libraries

Adjacency Matrix

Requires O(n2) memory n = number of vertices

Adjacency Matrix

Hard(er) to distribute

1

2

3

Adjacency List

Adjacency List

1 2,3

2 1,3

3 1,2

Adjacency List Design in HBase

e:dan@fullcontact.com

t:danklynn

p:+13039316251

Adjacency List Design in HBase

e:dan@fullcontact.com p:+13039316251= ...

t:danklynn= ...

p:+13039316251

t:danklynn= ...

e:dan@fullcontact.com= ...

row key “edges” column family

t:danklynn e:dan@fullcontact.com= ...

p:+13039316251= ...

Adjacency List Design in HBase

e:dan@fullcontact.com p:+13039316251= ...

t:danklynn= ...

p:+13039316251

t:danklynn= ...

e:dan@fullcontact.com= ...

row key “edges” column family

t:danklynn e:dan@fullcontact.com= ...

p:+13039316251= ...

What to

store?

Custom Writables

package org.apache.hadoop.io;

public interface Writable { void write(java.io.DataOutput dataOutput); void readFields(java.io.DataInput dataInput);}

java

Custom Writables

class EdgeValueWritable implements Writable { EdgeValue edgeValue

void write(DataOutput dataOutput) { dataOutput.writeDouble edgeValue.weight }

void readFields(DataInput dataInput) { Double weight = dataInput.readDouble() edgeValue = new EdgeValue(weight) }

// ...}

groovy

Don’t get fancy with byte[]

class EdgeValueWritable implements Writable { EdgeValue edgeValue

byte[] toBytes() { // use strings if you can help it}

static EdgeValueWritable fromBytes(byte[] bytes) { // use strings if you can help it}

}groovy

Querying by vertex

def get = new Get(vertexKeyBytes)get.addFamily(edgesFamilyBytes)

Result result = table.get(get);result.noVersionMap.each {family, data ->

// construct edge objects as needed// data is a Map<byte[],byte[]>

}

Adding edges to a vertex

def put = new Put(vertexKeyBytes)

put.add( edgesFamilyBytes, destinationVertexBytes, edgeValue.toBytes() // your own implementation here)

// if writing directlytable.put(put)

// if using TableReducercontext.write(NullWritable.get(), put)

Distributed Traversal / Indexing

e:dan@fullcontact.com

t:danklynn

p:+13039316251

Distributed Traversal / Indexing

e:dan@fullcontact.com

t:danklynn

p:+13039316251

Distributed Traversal / Indexing

e:dan@fullcontact.com

t:danklynn

p:+13039316251

Pivot vertex

Distributed Traversal / Indexing

e:dan@fullcontact.com

t:danklynn

p:+13039316251

MapReduce over outbound edges

Distributed Traversal / Indexing

e:dan@fullcontact.com

t:danklynn

p:+13039316251

Emit vertexes and edge data grouped by the pivot

Distributed Traversal / Indexing

e:dan@fullcontact.com

t:danklynn

p:+13039316251Reduce key

“Out” vertex

“In” vertex

Distributed Traversal / Indexing

e:dan@fullcontact.com t:danklynn

Reducer emits higher-order edge

Distributed Traversal / Indexing

Iteration 0

Distributed Traversal / Indexing

Iteration 1

Distributed Traversal / Indexing

Iteration 2

Distributed Traversal / Indexing

Iteration 2

Reuse edges created during previous iterations

Distributed Traversal / Indexing

Iteration 3

Distributed Traversal / Indexing

Iteration 3

Reuse edges created during previous iterations

Distributed Traversal / Indexing

hops requires only

iterations

Tips / Gotchas

Do implement your own comparator

java

public static class Comparator extends WritableComparator {

public int compare( byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { // ..... }

}

Do implement your own comparator

java

static { WritableComparator.define(VertexKeyWritable, new VertexKeyWritable.Comparator())}

MultiScanTableInputFormat

MultiScanTableInputFormat.setTable(conf,"graph");

MultiScanTableInputFormat.addScan(conf, new Scan());

job.setInputFormatClass(MultiScanTableInputFormat.class);

java

TableMapReduceUtil

TableMapReduceUtil.initTableReducerJob("graph", MyReducer.class, job);

java

Elastic MapReduce

Elastic MapReduce

HFiles

Elastic MapReduce

HFiles

SequenceFiles

Copy to S3

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

Copy to S3 Elastic MapReduce

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

Copy to S3 Elastic MapReduce

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

HFiles

Copy to S3 Elastic MapReduce

HFileOutputFormat.configureIncrementalLoad(job, outputTable)

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

HFiles HBase

Copy to S3 Elastic MapReduce

HFileOutputFormat.configureIncrementalLoad(job, outputTable)

$ hadoop jar hbase-VERSION.jar completebulkload

Additional Resources

Google Pregel: BSP-based graph processing system

Apache Giraph: Implementation of Pregel for Hadoop

MultiScanTableInputFormat: (code to appear on GitHub)

Apache Mahout - Distributed machine learning on Hadoop

Thanks!dan@fullcontact.com

Recommended