Transcript
Page 1: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Storing and Manipulating Graphs in HBase

Dan [email protected]

@danklynn

Page 2: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Keeps Contact Information Current and Complete

Based in Denver, Colorado

CTO & Co-Founder

Page 3: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Turn Partial Contacts Into Full Contacts

Page 4: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Refresher: Graph Theory

Page 5: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Refresher: Graph Theory

Page 6: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Refresher: Graph Theory

Vertex

Page 7: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Refresher: Graph Theory

Edge

Page 8: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Social Networks

Page 9: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Tweets

@danklynn

@xorlev

“#HBase rocks”

author

follows

retweeted

Page 10: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Web Links

http://fullcontact.com/blog/

http://techstars.com/

<a href=”...”>TechStars</a>

Page 11: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Why should you care?

Vertex Influence- PageRank

- Social Influence

- Network bottlenecks

Identifying Communities

Page 12: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Storage Options

Page 13: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

neo4j

Page 14: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Very expressive querying(e.g. Gremlin)

neo4j

Page 15: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Transactional

neo4j

Page 16: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Data must fit on a single machine

neo4j

:-(

Page 17: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

FlockDB

Page 18: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Scales horizontally

FlockDB

Page 19: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Very fast

FlockDB

Page 20: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

No multi-hop query support

:-(

FlockDB

Page 21: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

RDBMS(e.g. MySQL, Postgres, et al.)

Page 22: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Transactional

RDBMS

Page 23: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Huge amounts of JOINing

RDBMS

:-(

Page 24: HBaseCon 2012 | Storing and Manipulating Graphs in HBase
Page 25: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Massively scalable

HBase

Page 26: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Data model well-suited

HBase

Page 27: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Multi-hop querying?

HBase

Page 28: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Modeling Techniques

Page 29: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

1

2

3

Adjacency Matrix

Page 30: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency Matrix

0 1 1

1 0 1

1 1 0

1 2 3

1

2

3

Page 31: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency Matrix

Can use vectorized libraries

Page 32: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency Matrix

Requires O(n2) memory n = number of vertices

Page 33: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency Matrix

Hard(er) to distribute

Page 34: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

1

2

3

Adjacency List

Page 35: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency List

1 2,3

2 1,3

3 1,2

Page 36: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency List Design in HBase

e:[email protected]

t:danklynn

p:+13039316251

Page 37: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency List Design in HBase

e:[email protected] p:+13039316251= ...

t:danklynn= ...

p:+13039316251

t:danklynn= ...

e:[email protected]= ...

row key “edges” column family

t:danklynn e:[email protected]= ...

p:+13039316251= ...

Page 39: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Custom Writables

package org.apache.hadoop.io;

public interface Writable { void write(java.io.DataOutput dataOutput); void readFields(java.io.DataInput dataInput);}

java

Page 40: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Custom Writables

class EdgeValueWritable implements Writable { EdgeValue edgeValue

void write(DataOutput dataOutput) { dataOutput.writeDouble edgeValue.weight }

void readFields(DataInput dataInput) { Double weight = dataInput.readDouble() edgeValue = new EdgeValue(weight) }

// ...}

groovy

Page 41: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Don’t get fancy with byte[]

class EdgeValueWritable implements Writable { EdgeValue edgeValue

byte[] toBytes() { // use strings if you can help it}

static EdgeValueWritable fromBytes(byte[] bytes) { // use strings if you can help it}

}groovy

Page 42: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Querying by vertex

def get = new Get(vertexKeyBytes)get.addFamily(edgesFamilyBytes)

Result result = table.get(get);result.noVersionMap.each {family, data ->

// construct edge objects as needed// data is a Map<byte[],byte[]>

}

Page 43: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adding edges to a vertex

def put = new Put(vertexKeyBytes)

put.add( edgesFamilyBytes, destinationVertexBytes, edgeValue.toBytes() // your own implementation here)

// if writing directlytable.put(put)

// if using TableReducercontext.write(NullWritable.get(), put)

Page 44: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251

Page 45: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251

Page 46: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251

Pivot vertex

Page 47: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251

MapReduce over outbound edges

Page 48: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251

Emit vertexes and edge data grouped by the pivot

Page 49: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251Reduce key

“Out” vertex

“In” vertex

Page 50: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected] t:danklynn

Reducer emits higher-order edge

Page 51: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

Iteration 0

Page 52: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

Iteration 1

Page 53: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

Iteration 2

Page 54: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

Iteration 2

Reuse edges created during previous iterations

Page 55: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

Iteration 3

Page 56: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

Iteration 3

Reuse edges created during previous iterations

Page 57: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

hops requires only

iterations

Page 58: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Tips / Gotchas

Page 59: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Do implement your own comparator

java

public static class Comparator extends WritableComparator {

public int compare( byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { // ..... }

}

Page 60: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Do implement your own comparator

java

static { WritableComparator.define(VertexKeyWritable, new VertexKeyWritable.Comparator())}

Page 61: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

MultiScanTableInputFormat

MultiScanTableInputFormat.setTable(conf,"graph");

MultiScanTableInputFormat.addScan(conf, new Scan());

job.setInputFormatClass(MultiScanTableInputFormat.class);

java

Page 62: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

TableMapReduceUtil

TableMapReduceUtil.initTableReducerJob("graph", MyReducer.class, job);

java

Page 63: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

Page 64: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

HFiles

Page 65: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

HFiles

SequenceFiles

Copy to S3

Page 66: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

Copy to S3 Elastic MapReduce

Page 67: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

Copy to S3 Elastic MapReduce

Page 68: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

HFiles

Copy to S3 Elastic MapReduce

HFileOutputFormat.configureIncrementalLoad(job, outputTable)

Page 69: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

HFiles HBase

Copy to S3 Elastic MapReduce

HFileOutputFormat.configureIncrementalLoad(job, outputTable)

$ hadoop jar hbase-VERSION.jar completebulkload

Page 70: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Additional Resources

Google Pregel: BSP-based graph processing system

Apache Giraph: Implementation of Pregel for Hadoop

MultiScanTableInputFormat: (code to appear on GitHub)

Apache Mahout - Distributed machine learning on Hadoop


Recommended