Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Conference On Java, Pune, India]

Gaurav Kohli

Xebia

Breaking with DBMS and Dating with

Relational Hbase

me

Gaurav Kohli
[email protected]

About Consultant
Xebia IT Architects

Why are we here ?

Something about RDBMS

Limitations of RDBMS

Why Hbase or any NoSql solution

Overview of Hbase

Specific Use cases

Paradigm shift in Schema Design

Architecture of Hbase

Hbase Interface Java API, Thrift

Conclusion

Agenda

Databases

Relational

Relational Databases have a lot of

limitations

LimitationsData Set going into PetaBytes

RDBMS don't scale inherentlyScale up/Scale out ( Load Balancing + Replication)

Hard to shard / partition

Both read / write throughput not possibleTransactional / Analytical databases

Specialized Hardware ... is very expensiveOracle clustering

ReplicationMaster Slave

Master

Slave

Replication

Scaling Out

Master - Many Slave

Scaling Out MySQL master becomes a problem

All Slaves must have the same write capacity as master

Single point of failure, no easy failover

Master

Reads

Writes

Slave nodes

Dual Master

Master

Master

Slave

Replication

NoSQL

2006.11Google releases paper on BigTable

2007.2Initial HBase prototype created as Hadoop contrib.

2007.10First usable HBase

2008.1Hadoop become Apache top-level project and HBase becomes subproject

2010.5~Hbase becomes Apache top-level project

2010.6 Hbase 0.26.5 released.

2010.10 HBase 0.89.2010092 third developer release

Background

Distributeduses HDFS for storage

Column-Oriented

Multi-Dimensionalversions

High-Availability

High-Performance

Storage System

Hbase

A Sql DatabaseNo Joins, no query engine, no datatypes, no sql

No Schema

Denormalized data

Wide and sparsely populated data structure(key-value)

No DBA needed

Hbase is

Not

Bigness Big data, big number of users, big number of computers

Massive write performanceFacebook needs 135 billion messages a month

Twitter stores 7 TB data per day

Fast key-value access

Write availability

No Single point of failure

Use Case

Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, etc.

Real-time inserts, updates, and queries.

Fraud detection by comparing transactions to known patterns in real-time.

Analytics - Use MapReduce, Hive, or Pig to perform analytical queries

Specific

Use Case

Column-oriented database

Table are sorted by Row

Table schema only defines Column familiescolumn family can have any number of columns

Each cell value has a timestamp

Storage Model

Storage Model

Storage Model

Storage ModelSorted Map(RowKey, List(SortedMap(Column, List(value, Timestamp

)

)

)

)SortedMap(RowKey,List(SortedMap(Column,List(Value,Timestamp)))

A BIG SORTED MAP

Row Key+ Column Key + timestamp => value

Row KeyColumn KeyTimestampValue

1info:name1273516197868Gaurav

1info:age127387182418428

1info:age127387182302234

1info:sex1273746281432Male

2info:name1273863723227Harsh

3Info:name1273822456433Raman

2 Versionsof this row

Timestamp is a long value

Column Qualifier/Name

Sorted by Row key andcolumn key

Column family

Schema DesignStudent table

Every row has a row key Rows are stored sorted by row key

A table may have 1 or more column familiesCommon to have small number of column families

They should rarely change

Column family can have no. of columns

Each row has a timestampEach cell can have multiple versions

Schema DesignExample of a Student and Subject

Student Table

PKid

nameagesex

Example of a Student and Subject

Subject Table

PKid

titleintroductionteacher_id

Student-Subject Table

student_id

subject_id

type

m

n

Example of a Student and Subject

RDBMS

Schema Design

Three tableskeynameagesex

1Gaurav28Male

idtitleintroductionteacher_id

1HbaseHbase is cool10

Student table

Subject table

student_idsubject_idtype

11elective

Student-Subject table

Hbase

Student-Subject schema - Hbase

Schema DesignRow KeyColumn familyColumn Keys

student_idinfoname, age, sex

student_idsubjectsSubject Id's as qualifier(key)

Only two tableRow KeyColumn familyColumn Keys

subject_idinfotitle, introduction, teacher_id

subject_idstudentsStudent id's as qualifier(key)

Student table

Subject table

Hbase

Schema Designkeyinfosubjects

1info:name=Gauravinfo:age=28info:sex=Malesubjects:1=electivesubjects:2=main

keyinfostudents

1info:title=Hbaseinfo:introduction=Hbase is coolinfo:teacher_id=10students:1students:2

Student-Subject schema - Hbase

Student table

Subject table

Only two table

AttributePossible ValuesDefault

COMPRESSIONNONE,GZ,LZONONE

VERSIONS1+3

TTL1-2147483647(seconds)2147483647

BLOCKSIZE1 byte 2 GB64k

IN_MEMORYtrue,falsefalse

BLOCKCACHEtrue,falsetrue

Column families attributes

Region: Contiguous set of lexicographically sorted rowshbase.hregion.max.filesize (default:256 Mb)

Region hosted by Region Servers

Each Table is partitioned into Regions

Regions

Regions and

Splitting

row200

row201

row500

row1

new row

Regions and

Splitting

row200

row201

row350

row1

row 351

row 501

Master

Zookeeper

RegionServers

HDFS

MapReduce

Architecture

Architecture

Java API, Thrift...

Tools

Java API, Thrift...

ToolsJava

Thrift ( Ruby, Php, Python, Perl, C++... )

REST

Groovy DSL

MapReduce

Hbase Shell

Java API, Thrift...

ToolsJavaGet

Put

Delete

Scan

IncrementalColumnValue

Hbase v/s RDBMSNot a replacement

Solves only a small subset(~5%)

Conclusion

Where Sql makes life easyJoining

Secondary Indexing

Referential Integrity (updates)

ACID

Where Hbase makes life easy Dataset scale

Read/Write scale

Replication

Batch analysis

Conclusion

Hbase Apache (http://hbase.apache.org/)

Hbase Wiki (wiki.apache.org/hadoop/Hbase)

Hbase blog (blog.hbase.org)

Images from Google Search

http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html

References & Credit

Technology

Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Conference On Java, Pune, India]