HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Storing and Manipulating Graphs in HBase

Dan Lynndan@fullcontact.com

@danklynn

Keeps Contact Information Current and Complete

Based in Denver, Colorado

CTO & Co-Founder

Turn Partial Contacts Into Full Contacts

Refresher: Graph Theory

Vertex

Refresher: Graph Theory

Social Networks

Tweets

@danklynn

@xorlev

“#HBase rocks”

author

follows

retweeted

Web Links

http://fullcontact.com/blog/

http://techstars.com/

<a href=”...”>TechStars</a>

Why should you care?

Vertex Influence- PageRank

- Social Influence

- Network bottlenecks

Identifying Communities

Storage Options

Very expressive querying(e.g. Gremlin)

Transactional

Data must fit on a single machine

FlockDB

Scales horizontally

FlockDB

Very fast

FlockDB

No multi-hop query support

FlockDB

RDBMS(e.g. MySQL, Postgres, et al.)

Transactional

Huge amounts of JOINing

Massively scalable

Data model well-suited

Multi-hop querying?

Modeling Techniques

Adjacency Matrix

Can use vectorized libraries

Adjacency Matrix

Requires O(n2) memory n = number of vertices

Adjacency Matrix

Hard(er) to distribute

Adjacency List

Adjacency List Design in HBase

e:dan@fullcontact.com

t:danklynn

p:+13039316251

e:dan@fullcontact.com p:+13039316251= ...

t:danklynn= ...

p:+13039316251

t:danklynn= ...

e:dan@fullcontact.com= ...

row key “edges” column family

t:danklynn e:dan@fullcontact.com= ...

p:+13039316251= ...

e:dan@fullcontact.com p:+13039316251= ...

t:danklynn= ...

p:+13039316251

t:danklynn= ...

e:dan@fullcontact.com= ...

row key “edges” column family

t:danklynn e:dan@fullcontact.com= ...

p:+13039316251= ...

What to

store?

Custom Writables

package org.apache.hadoop.io;

public interface Writable { void write(java.io.DataOutput dataOutput); void readFields(java.io.DataInput dataInput);}

Custom Writables

class EdgeValueWritable implements Writable { EdgeValue edgeValue

void write(DataOutput dataOutput) { dataOutput.writeDouble edgeValue.weight }

void readFields(DataInput dataInput) { Double weight = dataInput.readDouble() edgeValue = new EdgeValue(weight) }

// ...}

groovy

Don’t get fancy with byte[]

class EdgeValueWritable implements Writable { EdgeValue edgeValue

byte[] toBytes() { // use strings if you can help it}

static EdgeValueWritable fromBytes(byte[] bytes) { // use strings if you can help it}

}groovy

Querying by vertex

def get = new Get(vertexKeyBytes)get.addFamily(edgesFamilyBytes)

Result result = table.get(get);result.noVersionMap.each {family, data ->

// construct edge objects as needed// data is a Map<byte[],byte[]>

Adding edges to a vertex

def put = new Put(vertexKeyBytes)

put.add( edgesFamilyBytes, destinationVertexBytes, edgeValue.toBytes() // your own implementation here)

// if writing directlytable.put(put)

// if using TableReducercontext.write(NullWritable.get(), put)

Distributed Traversal / Indexing

t:danklynn

p:+13039316251

t:danklynn

p:+13039316251

t:danklynn

p:+13039316251

Pivot vertex

t:danklynn

p:+13039316251

MapReduce over outbound edges

t:danklynn

p:+13039316251

Emit vertexes and edge data grouped by the pivot

t:danklynn

p:+13039316251Reduce key

“Out” vertex

“In” vertex

e:dan@fullcontact.com t:danklynn

Reducer emits higher-order edge

Iteration 0

Iteration 1

Iteration 2

Reuse edges created during previous iterations

Iteration 3

Reuse edges created during previous iterations

hops requires only

iterations

Tips / Gotchas

Do implement your own comparator

public static class Comparator extends WritableComparator {

public int compare( byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { // ..... }

Do implement your own comparator

static { WritableComparator.define(VertexKeyWritable, new VertexKeyWritable.Comparator())}

MultiScanTableInputFormat

MultiScanTableInputFormat.setTable(conf,"graph");

MultiScanTableInputFormat.addScan(conf, new Scan());

job.setInputFormatClass(MultiScanTableInputFormat.class);

TableMapReduceUtil

TableMapReduceUtil.initTableReducerJob("graph", MyReducer.class, job);

Elastic MapReduce

HFiles

Elastic MapReduce

HFiles

SequenceFiles

Copy to S3

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

Copy to S3 Elastic MapReduce

Elastic MapReduce

HFiles

Elastic MapReduce

HFiles

HFileOutputFormat.configureIncrementalLoad(job, outputTable)

Elastic MapReduce

HFiles

HFiles HBase

HFileOutputFormat.configureIncrementalLoad(job, outputTable)

$ hadoop jar hbase-VERSION.jar completebulkload

Additional Resources

Google Pregel: BSP-based graph processing system

Apache Giraph: Implementation of Pregel for Hadoop

MultiScanTableInputFormat: (code to appear on GitHub)

Apache Mahout - Distributed machine learning on Hadoop

Thanks!dan@fullcontact.com

HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Technology

HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

HBaseCon 2013: General Session

HBaseCon 2012 | Orchestrating Clusters with Ironfan and Chef - Runa

HTrace: Tracing in HBase and HDFS (HBase Meetup)

HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase

HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro

HBaseCon 2012 | Real-time Analytics with HBase - Sematext

MyLife with HBase or HBase three flavors

Hbase in action - Chapter 09: Deploying HBase

HBase User Group #9: HBase and HDFS

HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily & HBase - ngdata

HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

HBaseCon 2013: Real-Time Model Scoring in Recommender Systems

TiDB: HBase分布式事务 SQL实现 · HBase is fast approaching equal hotness status” Form HBaseCon 2015. We want more ! SQL + Transaction(ACID) TiDB Features Consistent distributed

HBaseCon 2014-Just the Basics

HBase train Stark - community.qingcloud.com · HBase 介绍及特点 HBase 系统架构 HBase 集群搭建 HBase 存储结构 HBase 关键流程 HBase 使用及开发 HBase 大纲

HBaseCon 2015- HBase @ Flipboard