120
Cloud deployments with Apache Hadoop and Apache HBase 8/23/11 NoSqlNow! 2011 Conference Jonathan Hsieh jon@cloudera com jon@cloudera.com @jmhsieh

Cloud Deployments with Apache Hadoop and Apache HBase

Embed Size (px)

Citation preview

Page 1: Cloud Deployments with Apache Hadoop and Apache HBase

Cloud deployments with Apache Hadoop and Apache HBase

8/23/11 NoSqlNow! 2011 ConferenceJonathan Hsieh

jon@cloudera [email protected]@jmhsieh

Page 2: Cloud Deployments with Apache Hadoop and Apache HBase

Who Am I?

• Cloudera:• Software Engineer on the Platform Team

• Apache HBase contributor• Apache HBase contributor• Apache Flume (incubating) founder/committer

• Apache Sqoop (incubating) committer

• U of Washington:U of Washington:• Research in Distributed Systems and Programming Languages

28/23/11  NoSQLNow! 2011 Conference

Page 3: Cloud Deployments with Apache Hadoop and Apache HBase

Who is Cloudera? 

Cloudera, the leader in ApacheCloudera, the leader in Apache Hadoop‐based software and 

services, enables enterprises to easily derive business value fromeasily derive business value from 

all their data.

38/23/11  NoSQLNow! 2011 Conference

Page 4: Cloud Deployments with Apache Hadoop and Apache HBase

48/23/11  NoSQLNow! 2011 Conference

Page 5: Cloud Deployments with Apache Hadoop and Apache HBase

58/23/11  NoSQLNow! 2011 Conference

Page 6: Cloud Deployments with Apache Hadoop and Apache HBase

68/23/11  NoSQLNow! 2011 Conference

Page 7: Cloud Deployments with Apache Hadoop and Apache HBase

“E t d t h“Every two days we create as much information as we did from the dawn of civilization up until 2003.” 

Eric SchmidtEric Schmidt

“I keep saying that the sexy job in the p y g y jnext 10 years will be statisticians. And I’m not kidding.”And I m not kidding.

Hal Varian (Google’s chief economist)(Google s chief economist)

78/23/11  NoSQLNow! 2011 Conference

Page 8: Cloud Deployments with Apache Hadoop and Apache HBase

Outline

• Motivation • Enter Apache Hadoop • Enter Apache HBase• Real‐World Applications• System Architecture• Deployment (in the Cloud)• Conclusions

88/23/11  NoSQLNow! 2011 Conference

Page 9: Cloud Deployments with Apache Hadoop and Apache HBase

Outline

• Motivation• Enter Apache Hadoop • Enter Apache HBase• Real‐World Applications• System Architecture• Deployment (in the Cloud)• Conclusions

98/23/11  NoSQLNow! 2011 Conference

Page 10: Cloud Deployments with Apache Hadoop and Apache HBase

What is Apache HBase?p

Apache HBase is an popen source, 

h i t ll l blhorizontally scalable,sorted map data p

store built on top of Apache HadoopApache Hadoop.

108/23/11  NoSQLNow! 2011 Conference

Page 11: Cloud Deployments with Apache Hadoop and Apache HBase

What is Apache Hadoop?p p

Apache Hadoop is anp popen source,

horizontally scalablehorizontally scalable system for reliably

storing and processing massive amounts ofmassive amounts of data across many  

dicommodity servers.118/23/11  NoSQLNow! 2011 Conference

Page 12: Cloud Deployments with Apache Hadoop and Apache HBase

Open Sourcep

• Apache 2.0 License• A Community project with committers and contributors from diverse organizations.• Cloudera, Facebook, Yahoo!, eBay, StumbleUpon, Trend Micro, …

• Code license means anyone can modify and use the• Code license means anyone can modify and use the code.

128/23/11  NoSQLNow! 2011 Conference

Page 13: Cloud Deployments with Apache Hadoop and Apache HBase

Horizontally Scalabley

• Store and access data on 000’ di

500

600

put)

1‐1000’s commodity servers.Addi

200

300

400

Performan

ce 

/Storage/Throu

ghp

• Adding more servers should linearly increase performance and

0

100(IOPs/

# of servers

performance and capacity.• Storage capacityStorage capacity • Processing capacity• Input/output operationsp / p p

138/23/11  NoSQLNow! 2011 Conference

Page 14: Cloud Deployments with Apache Hadoop and Apache HBase

Commodity Servers (circa 2010)y ( )

• 2 quad core CPUs, running at least 2‐2.5GHz• 16‐24GBs of RAM (24‐32GBs if you’re considering HBase)

• 4x 1TB hard disks in a JBOD (Just a Bunch Of Disks) configuration

• Gigabit Ethernet• $5k‐10k / machine

148/23/11  NoSQLNow! 2011 Conference

Page 15: Cloud Deployments with Apache Hadoop and Apache HBase

Let’s deploy some machinesLet s deploy some machines.(in the cloud!)

158/23/11  NoSQLNow! 2011 Conference

Page 16: Cloud Deployments with Apache Hadoop and Apache HBase

We’ll use Apache Whirrp

Apache Whirr is set ofApache Whirr is set of tools and libraries for deploying clusters on cloud services in acloud services in a cloud‐neutral way.y

168/23/11  NoSQLNow! 2011 Conference

Page 17: Cloud Deployments with Apache Hadoop and Apache HBase

Ready?y

export AWS_ACCESS_KEY_ID=...

export AWS_SECRET_ACCESS_KEY=...

curl -O http://www apache org/dist/incubator/whirr/whttp://www.apache.org/dist/incubator/whirr/whirr-0.5.0-incubating/whirr-0.5.0-incubating.tar.gz

tar zxf whirr-0.5.0-incubating.tar.gz

cd whirr-0.5.0-incubating

bin/whirr launch cluster configbin/whirr launch-cluster --configrecipes/hbase-ec2.properties

178/23/11  NoSQLNow! 2011 Conference

Page 18: Cloud Deployments with Apache Hadoop and Apache HBase

Done. That’s it.Done.  That s it. 

You can go home now.

Ok k ’ll b k t thi l tOk, ok, we’ll come back to this later.

188/23/11  NoSQLNow! 2011 Conference

Page 19: Cloud Deployments with Apache Hadoop and Apache HBase

“The future is already here — it'sThe future is already here — it s just not very evenly distributed.”

William GibsonWilliam Gibson

198/23/11  NoSQLNow! 2011 Conference

Page 20: Cloud Deployments with Apache Hadoop and Apache HBase

208/23/11  NoSQLNow! 2011 Conference

Page 21: Cloud Deployments with Apache Hadoop and Apache HBase

Building a web indexg

• Download all of the web. 

• Store all of it.

• Analyze all of it to build the index and rankings.

• Repeat.p

218/23/11  NoSQLNow! 2011 Conference

Page 22: Cloud Deployments with Apache Hadoop and Apache HBase

Size of Google’s Web Indexg

• Let’s assume 50k per webpage, 500 bytes per URL.• According to Google*

• 1998: 26Mn indexed pages 10000

100000

orage in TB

• 1.3TB

• 2000:  1Bn indexed pages• 500 TB 100

1000

of Goo

gle's W

eb Sto

500 TB

• 2008: ~40Bn indexed pages• 20,000 TB

10

Estim

ated

 Size o

• 2008: 1 Tn URLS• ~500 TB just in urls names!

1

Year

* http://googleblog.blogspot.com/2008/07/we‐knew‐web‐was‐big.html

228/23/11  NoSQLNow! 2011 Conference

Page 23: Cloud Deployments with Apache Hadoop and Apache HBase

Volume, Variety, and Velocity, y, y

• The web has a massive amount of data.• How do we store this massive volume?

• Raw web data is diverse, dirty, and semi‐structured.• How do we deal with all the compute necessary to clean and process this variety of data?process this variety of data?

• There is new content being created all the time!There is new content being created all the time!• How do we keep up the velocity of new data coming in?

238/23/11  NoSQLNow! 2011 Conference

Page 24: Cloud Deployments with Apache Hadoop and Apache HBase

Telltale signs you might needTelltale signs you might need “noSQL”.

248/23/11  NoSQLNow! 2011 Conference

Page 25: Cloud Deployments with Apache Hadoop and Apache HBase

Did you try scaling vertically?y y g y

=>

• Upgrading to a beefier machines could be quick.• (upgrade that m1 large to a m2 4xlarge)• (upgrade that m1.large to a m2.4xlarge)

• This is probably a good idea.  • Not quite time for HBaseNot quite time for HBase.

• What if this isn’t enough?What if this isn t enough?

258/23/11  NoSQLNow! 2011 Conference

Page 26: Cloud Deployments with Apache Hadoop and Apache HBase

Changed your schema and queries?g y q

• Remove text search queries (LIKE).• These are expensive.

• Remove joins. • Normalization is more expensive today• Normalization is more expensive today. • Multiple seeks are more expensive than sequential reads/writes.

• Remove foreign keys and encode your own relations.Remove foreign keys and encode your own relations.• Avoids constraint checks.

• Just put all parts of a query in a single table.• Lots of Full table scans?  

• Good time for Hadoop.

• This might be a good time to consider HBase.268/23/11  NoSQLNow! 2011 Conference

Page 27: Cloud Deployments with Apache Hadoop and Apache HBase

Need to scale reads?

• Using DB replication to make more copies to read from• Use Memcached

• Assumes 80/20 read to write ratio, this works reasonably well if can tolerate replication lag.

278/23/11  NoSQLNow! 2011 Conference

Page 28: Cloud Deployments with Apache Hadoop and Apache HBase

Need to scale writes?

• Unfortunately, eventually you may need more writes. • Let’s shard and federate the DB

• Loses consistency, order of operations.• Replication has diminishing returns with more writes.

• HA operational complexity! 500

600

• HA operational complexity!• Gah!

200

300

400

rforman

ce (IOPs)

• This is definitely a good time to consider HBase0

100Pe

# of servers

• This is definitely a good time to consider HBase

288/23/11  NoSQLNow! 2011 Conference

Page 29: Cloud Deployments with Apache Hadoop and Apache HBase

W i “ i i d h DB” bWait – we “optimized the DB” by discarding some fundamentaldiscarding some fundamental SQL/relational databases 

features?

298/23/11  NoSQLNow! 2011 Conference

Page 30: Cloud Deployments with Apache Hadoop and Apache HBase

Yup.

308/23/11  NoSQLNow! 2011 Conference

Page 31: Cloud Deployments with Apache Hadoop and Apache HBase

Assumption #1: Your workload fits on one machine…

Image: Matthew J. Stinson CC-BY-NC318/23/11  NoSQLNow! 2011 Conference

Page 32: Cloud Deployments with Apache Hadoop and Apache HBase

Massive amounts of storageg

• How much data could you collect today?• Many companies easily collect 200GB of logs per day.• Facebook claims to collect >15 TB per day.

• How do you handle this problem?• Just keep a few days worth of data and then /dev/null.p y / /• Sample your data.• Move data to write only media (tape)

• If you want to analyze all your data, you are going to need to use multiple machinesneed to use multiple machines.

328/23/11  NoSQLNow! 2011 Conference

Page 33: Cloud Deployments with Apache Hadoop and Apache HBase

Assumption #2:  Machines deserve identities...

Image:Laughing Squid CC BY-NC-SA338/23/11  NoSQLNow! 2011 Conference

Page 34: Cloud Deployments with Apache Hadoop and Apache HBase

Interact with a cluster notInteract with a cluster, not a bunch of machines.a bunch of machines.

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]348/23/11  NoSQLNow! 2011 Conference

Page 35: Cloud Deployments with Apache Hadoop and Apache HBase

Assumption #3: Machines can be reliable…

Image: MadMan the Mighty CC BY-NC-SA 358/23/11  NoSQLNow! 2011 Conference

Page 36: Cloud Deployments with Apache Hadoop and Apache HBase

Disk failures are very commony

• USENIX FAST ‘07:

• tl;dr: 2‐8 % new disk drives fail per year.; p y• For a 100 node cluster with 1200 drives:

• a drive failure every 15‐60 daysy y

368/23/11  NoSQLNow! 2011 Conference

Page 37: Cloud Deployments with Apache Hadoop and Apache HBase

Outline

• Motivation • Enter Apache Hadoop • Enter Apache HBase• Real‐World Applications• System Architecture• Cluster Deployment • Cloud Deploymentp y• Demo• ConclusionsConclusions

378/23/11  NoSQLNow! 2011 Conference

Page 38: Cloud Deployments with Apache Hadoop and Apache HBase

What is Apache Hadoop?p p

Apache Hadoop is anp popen source,

horizontally scalablehorizontally scalable system for reliably

storing and processing massive amounts ofmassive amounts of data across many  

dicommodity servers. 388/23/11  NoSQLNow! 2011 Conference

Page 39: Cloud Deployments with Apache Hadoop and Apache HBase

What is Apache Hadoop?p p

Apache Hadoop is anp popen source,

horizontally scalablehorizontally scalable system for reliably 

storing and processingmassive amounts ofmassive amounts of data across many  

dicommodity servers. 398/23/11  NoSQLNow! 2011 Conference

Page 40: Cloud Deployments with Apache Hadoop and Apache HBase

Goal: Separate complicatedGoal: Separate complicated distributed fault‐tolerance code from application code

Unicorns

Systems Programmers StatisticiansSystems Programmers Statisticians

408/23/11  NoSQLNow! 2011 Conference

Page 41: Cloud Deployments with Apache Hadoop and Apache HBase

What did Google do?g

• SOSP 2003:

• OSDI 2004:

418/23/11  NoSQLNow! 2011 Conference

Page 42: Cloud Deployments with Apache Hadoop and Apache HBase

Origin of Apache Hadoopg p p

Hadoop wins Terabyte sort benchmark

Open Source, W b C l

Publishes MapReduce, GFS Paper

Open Source, MapReduce & HDFS project 

Runs 4,000 Node Hadoop 

Releases CDH3 and Cloudera Enterprise

Web Crawler project created by Doug Cutting

created by Doug Cutting

ClusterLaunches SQL Support for Hadoop

2005 2007 200920082004 2006 201020032002

428/23/11  NoSQLNow! 2011 Conference

Page 43: Cloud Deployments with Apache Hadoop and Apache HBase

Apache Hadoop in production todayp p p y

• Yahoo! Hadoop Clusters: > 82PB, >40k machines( f ‘ )(as of Jun ‘11)

• Facebook: 15TB new data per day;1200+ machines 30PB in one cluster1200+ machines, 30PB in one cluster

• Twitter: >1TB per day, ~120 nodes

• Lots of 5‐40 node clusters at companies withoutpetabytes of data (web retail finance telecompetabytes of data (web, retail, finance, telecom, research, government)

438/23/11  NoSQLNow! 2011 Conference

Page 44: Cloud Deployments with Apache Hadoop and Apache HBase

Case studies: Hadoop World ‘10p

• eBay: Hadoop at eBay• Twitter: The Hadoop Ecosystem at TwitterTwitter: The Hadoop Ecosystem at Twitter• Yale University:MapReduce and Parallel Database Systems• General Electric: Sentiment Analysis powered by Hadoop• Facebook: HBase in Production • Bank of America: The Business of Big Data• AOL: AOL’s Data Layer• AOL: AOLs Data Layer• Raytheon: SHARD: Storing and Querying Large‐Scale Data• StumbleUpon:Mixing Real‐Time and Batch Processing

More Info atMore Info athttp://www.cloudera.com/company/press‐center/hadoop‐world‐nyc/

448/23/11  NoSQLNow! 2011 Conference

Page 45: Cloud Deployments with Apache Hadoop and Apache HBase

What is HDFS?

• HDFS is a file system. • Just stores (a lot of) bytes as files.• Distributes storage across many machines and many disks.R li bl b li i f bl k h hi• Reliable by replicating of blocks across the machines throughout the cluster.

• Horizontally Scalable ‐‐ just add more machines and disks.Horizontally Scalable  just add more machines and disks.• Optimized for large sequential writes.

• Features• Unix‐style permissions.• Kerberos‐based authentication.

458/23/11  NoSQLNow! 2011 Conference

Page 46: Cloud Deployments with Apache Hadoop and Apache HBase

HDFS’s File API

• Dir• List files• Remove filesC fil• Copy files

• Put / Get Files

• File• File• Open • CloseClose• Read • Write/Appendpp

468/23/11  NoSQLNow! 2011 Conference

Page 47: Cloud Deployments with Apache Hadoop and Apache HBase

Ideal use cases

• Great for storage of massive amounts of rawor uncooked data

M i d t fil• Massive data files• All of your logs

478/23/11  NoSQLNow! 2011 Conference

Page 48: Cloud Deployments with Apache Hadoop and Apache HBase

Massive scale while tolerating machine failuresg

488/23/11  NoSQLNow! 2011 Conference

Page 49: Cloud Deployments with Apache Hadoop and Apache HBase

Hadoop gives you agilityp g y g y

• Schema on write• Traditional DBs require cleaning and applying schema.• Great if you can plan on you schema well in advance.

• Schema on readHDFS bl t t ll f d t• HDFS enables you to store all of your raw data.

• Great if you have ad hoc queries on ad hoc data. • If you don’t know your schema you can try new onesIf you don t know your schema, you can try new ones.• Great if you are exploring schemas and  transforming data.

498/23/11  NoSQLNow! 2011 Conference

Page 50: Cloud Deployments with Apache Hadoop and Apache HBase

Analyzing data in with MapReducey g p

• Apache Hadoop MapReduce• Simplified distributed programming model.• Just specify a “map” and a “reduce” function.

• MapReduce is a batch processing system.O ti i d f th h t t l t• Optimized for throughput not latency.

• The fastest MR job takes 15+ seconds.

• You are not going to use this to directly serve data for your next web siteyour next web site. 

508/23/11  NoSQLNow! 2011 Conference

Page 51: Cloud Deployments with Apache Hadoop and Apache HBase

Don’t like programming Java?p g g

• Apache Pig• Higher level dataflow language• Good for data transformationsG ll f d b• Generally preferred by programmers

• Apache Hive• Apache Hive• SQL‐like language for querying data in HDFS• Generally preferred by Data scientists and Business AnalystsGenerally preferred by Data scientists and Business Analysts

518/23/11  NoSQLNow! 2011 Conference

Page 52: Cloud Deployments with Apache Hadoop and Apache HBase

There is a catch…

• Files are append‐only.• No update or random writes.• Data not available for read until file is flushedfile is flushed.

• Files are ideally large.• Enables storage of 10’s of• Enables storage of 10 s of petabytes of data

• HDFS has split the file into 64MB or 128MB blocks! 

528/23/11  NoSQLNow! 2011 Conference

Page 53: Cloud Deployments with Apache Hadoop and Apache HBase

Outline

• Motivation • Enter Apache Hadoop • Enter Apache HBase• Real‐World Applications• System Architecture• Deployment (in the Cloud)• Conclusions

538/23/11  NoSQLNow! 2011 Conference

Page 54: Cloud Deployments with Apache Hadoop and Apache HBase

What is Apache HBase?p

Apache HBase is an popen source, 

h i t ll l blhorizontally scalable,sorted map data p

store built on top of Apache HadoopApache Hadoop.

548/23/11  NoSQLNow! 2011 Conference

Page 55: Cloud Deployments with Apache Hadoop and Apache HBase

What is Apache HBase?p

Apache HBase is an popen source, 

h i t ll l blhorizontally scalable,sorted map data p

store built on top of Apache HadoopApache Hadoop.

558/23/11  NoSQLNow! 2011 Conference

Page 56: Cloud Deployments with Apache Hadoop and Apache HBase

Inspiration: Google BigTablep g g

• OSDI 2006 paper

• Goal: Quick random read/write access to massive Q /amounts of structured data. • It was the data store for Google’s crawler web table, orkut, analytics, earth, blogger, …

568/23/11  NoSQLNow! 2011 Conference

Page 57: Cloud Deployments with Apache Hadoop and Apache HBase

Sorted Map Datastorep

• It really is just a Big Table!

• Tables consist of rows, each of 

0000000000

1111111111

2222222222

which has a primary key (row key).• Each row may have any number of 

3333333333

4444444444

5555555555

6666666666

columns.• Rows are stored in sorted order

7777777777

578/23/11  NoSQLNow! 2011 Conference

Page 58: Cloud Deployments with Apache Hadoop and Apache HBase

Anatomy of a Rowy

• Each row has a row key (think primary key)• Lexicographically sorted byte[]• Lexicographically sorted byte[]• Timestamp associated for keeping multiple versions of data

R i d f l• Row is made up of columns. • Each column has a Cell

• Contents of a cell are an untyped byte[]’s.• Apps must “know” types and handle them.

• Columns are logically like a Map<byte[] column, byte[] value>

• Rows edits are atomic and changes are strongly consistent (replicas are in sync)(replicas are in sync)

588/23/11  NoSQLNow! 2011 Conference

Page 59: Cloud Deployments with Apache Hadoop and Apache HBase

Sorted Map Datastore (logical view)p ( g )

Implicit PRIMARY KEY in RDBMS terms Data is all stored as byte[]

Row key Data (column : value)

i ‘i f h i h ’ ‘9f ’ ‘i f ’ ‘CA’

RDBMS terms y

cutting ‘info:height’: ‘9ft’, ‘info:state’: ‘CA’ ‘roles:ASF’: ‘Director’, ‘roles:Hadoop’: ‘Founder’ 

tlipcon ‘info:height’: ‘5ft7’, ‘info:state’: ‘CA’‘roles:Hadoop’: ‘Committer’@ts=2010,‘roles:Hadoop’: ‘PMC’@ts=2011,‘roles:Hive’: ‘Contributor’ 

A single cell might have differentvalues at different timestamps

Different rows may have different sets of columns(table is sparse)

Useful for *‐To‐Many mappings

598/23/11  NoSQLNow! 2011 Conference

Page 60: Cloud Deployments with Apache Hadoop and Apache HBase

Apache HBase Depends upon HDFSp p p

• Relies on HDFS for data durability and reliability.

• Uses HDFS to store its Write‐Ahead Log (WAL).• Need flush/sync support in HDFS in order to prevent data loss problems.

608/23/11  NoSQLNow! 2011 Conference

Page 61: Cloud Deployments with Apache Hadoop and Apache HBase

HBase in Numbers

• Largest cluster: ~1000 nodes, ~1PB• Most clusters: 5‐20 nodes, 100GB‐4TB• Writes: 1‐3ms, 1k‐10k writes/sec per node• Reads: 0‐3ms cached, 10‐30ms disk

• 10‐40k reads / second / node from cache

• Cell size: 0‐3MB preferred

618/23/11  NoSQLNow! 2011 Conference

Page 62: Cloud Deployments with Apache Hadoop and Apache HBase

Access data via an API.  There is “noSQL”*Q

• HBase API• get(row)• put(row, Map<column, value>)

(k fil )• scan(key range, filter)• increment(row, columns)• (checkAndPut delete etc )• … (checkAndPut, delete, etc…)

• *Ok that’s a slight lieOk, that s a slight lie.  • There is work on integrating Apache Hive, a SQL‐like query language, with HBase. 

• This is not optimal; x5 slower than normal Hive+HDFS.628/23/11  NoSQLNow! 2011 Conference

Page 63: Cloud Deployments with Apache Hadoop and Apache HBase

Cost Transparencyp y

• Goal: Want predictable latency of random read and write operations.• To do this, you have to understand some of the physical layout of your datastoreof your datastore.

• Efficiencies are based on Locality.

• A few physical concepts to help:• Column FamiliesColumn Families• Regions

638/23/11  NoSQLNow! 2011 Conference

Page 64: Cloud Deployments with Apache Hadoop and Apache HBase

Column Families0000000000

1111111111

• Just a set of related columns.• Each may have different

2222222222

3333333333

4444444444

• Each may have different columns and access patterns.

• Each may have parameters set per column family:

5555555555

6666666666

7777777777

set per column family:• Block Compression (none, gzip, 

LZO, Snappy)• Version retention policies• Cache priority

• Improves read performance.• CFs stored separately: access 

0000000000

1111111111

2222222222

0000000000

1111111111

2222222222

one without wasting IO on the other.

• Store related data together for better compression

3333333333

4444444444

5555555555

6666666666

3333333333

4444444444

5555555555

6666666666better compression7777777777 7777777777

648/23/11  NoSQLNow! 2011 Conference

Page 65: Cloud Deployments with Apache Hadoop and Apache HBase

Sparse Columns0000000000

1111111111p

• Provides schema flexibility

2222222222

3333333333

4444444444

• Add columns later, no need to transform entire schema.

• Use for writing aggregates 

5555555555

6666666666

7777777777g gg gatomically (“prejoins”)

• Improves performance• Null columns don’t take 

space.  You don’t need to read what is not there.

0000000000

1111111111

2222222222

0000000000

1111111111

2222222222

• If you have a traditional db table with lots of nulls, you data will probably fit well

4444444444

5555555555

6666666666

3333333333

4444444444

data will probably fit well7777777777 7777777777

658/23/11  NoSQLNow! 2011 Conference

Page 66: Cloud Deployments with Apache Hadoop and Apache HBase

Regionsg

• Tables are divided into sets of rows called regions• Read and write load are scaled by spreading across many regions.

0000000000

0000000000

1111111111

1111111111

2222222222

2222222222

3333333333

4444444444

3333333333

4444444444

5555555555

6666666666

7777777777

5555555555

6666666666

77777777777777777777

668/23/11  NoSQLNow! 2011 Conference

Page 67: Cloud Deployments with Apache Hadoop and Apache HBase

Sorted Map Datastore (physical view)p (p y )

Row key Column key Timestamp Cell valueinfo Column Family

cutting info:height 1273516197868 9ft

cutting info:state 1043871824184 CA

tlipcon info:height 1273878447049 5ft7

Row key Column key Timestamp Cell value

p g

tlipcon info:state 1273616297446 CA

roles Column FamilyRow key Column key Timestamp Cell value

cutting roles:ASF 1273871823022 Director

cutting roles:Hadoop 1183746289103 Founder

li l H d 1300062064923 PMC

Sortedon disk by

Row key Col tlipcon roles:Hadoop 1300062064923 PMC

tlipcon roles:Hadoop 1293388212294 Committer

tlipcon roles:Hive 1273616297446 Contributor

Row key, Col key, 

descending timestamp

678/23/11  NoSQLNow! 2011 Conference

Milliseconds since unix epoch

Page 68: Cloud Deployments with Apache Hadoop and Apache HBase

HBase purposely doesn’t have everythingp p y y g

• No atomic multi‐row operations

• No global time ordering

• No built‐in SQL query language

• No query Optimizerq y p

688/23/11  NoSQLNow! 2011 Conference

Page 69: Cloud Deployments with Apache Hadoop and Apache HBase

HBase vs just HDFSj

Plain HDFS/MR HBaseWrite pattern Append‐only Random write, bulk 

incremental

Read pattern Full table scan, partition table  Random read, small range scan scan, or table scan

Hive (SQL) performance Very good 4‐5x slower

St t d t D it lf / TSV / S l f il d tStructured storage Do‐it‐yourself / TSV / SequenceFile / Avro / ?

Sparse column‐family data model

Max data size 30+ PB ~1PB

If you have neither random write nor random read, stick to HDFS!

698/23/11  NoSQLNow! 2011 Conference

Page 70: Cloud Deployments with Apache Hadoop and Apache HBase

What if I don’t know what my schema should be?y

• MR and HBase complement each other.• Use HDFS for long sequential writes.• Use MR for large batch jobs.• Use HBase for random writes and readsUse HBase for random writes and reads.

• Applications need HBase to have data structured in a ppcertain way.

• Save raw data to HDFS and then experiment.• MR for data transformation and ETL‐like jobs from raw data.U b lk i t f MR t HB• Use bulk import from MR to HBase.

708/23/11  NoSQLNow! 2011 Conference

Page 71: Cloud Deployments with Apache Hadoop and Apache HBase

Outline

• Motivation • Enter Apache Hadoop • Enter Apache HBase• Real‐World Applications• System Architecture• Deployment (in the Cloud)• Conclusions

718/23/11  NoSQLNow! 2011 Conference

Page 72: Cloud Deployments with Apache Hadoop and Apache HBase

Apache HBase in Productionp

• Facebook :• Messages

• StumbleUpon:• http://su pr• http://su.pr

• Mozilla :• Socorro ‐‐ receives crash reportsp

• Yahoo:• Web Crawl Cache

• Twitter:• stores users and tweets for analytics

h• … many others728/23/11  NoSQLNow! 2011 Conference

Page 73: Cloud Deployments with Apache Hadoop and Apache HBase

High Le el Architect reHigh Level ArchitectureYour PHP Application

MapReduceHive/Pig

Thrift/REST Gateway Your Java Application

ZooKeeper

Java Client

HBase

HDFS

738/23/11  NoSQLNow! 2011 Conference

Page 74: Cloud Deployments with Apache Hadoop and Apache HBase

Data‐centric schema design?g

• Entity relationship model.• Design schema in “Normalized form”.• Figure out your queries.DBA i /f i k d i d i• DBA to sets primary/foreign keys and indexes once query is known.

• Issues:• Difficult and expensive to change schema. p g• Difficult and expensive to add columns.

748/23/11  NoSQLNow! 2011 Conference

Page 75: Cloud Deployments with Apache Hadoop and Apache HBase

Query‐centric schema designQ y g

• Know your queries, then design your schema

• Pick row keys to spread region load• Spreading loads can increase read and write efficiency.p g y

• Pick column‐family members for better reads• Create these by knowing fields needed by queries.

I b h f h• Its better to have a fewer than many.

• Notice:• App developers optimize the queries, not DBAs.• If you’ve done the relational DB query optimizations,  you are 

mostly there already!mostly there already!

758/23/11  NoSQLNow! 2011 Conference

Page 76: Cloud Deployments with Apache Hadoop and Apache HBase

Schema design exercisesg

• URL Shortener• Bit.ly, goo.gl, su.pr etc.

• Web table• Google BigTable’s example, Yahoo!’s Web Crawl Cache

• Facebook Messages • Conversations and Inbox Search• Conversations and Inbox Search• Transition strategies

768/23/11  NoSQLNow! 2011 Conference

Page 77: Cloud Deployments with Apache Hadoop and Apache HBase

Url Shortener Service Lookup hash, track click, and forward to full urlforward to full url

Enter new long url, generate short url, and store to user’s mappingpp g

778/23/11  NoSQLNow! 2011 Conference

Track historical click counts over time

Look up all of a user’s shortened urls and display

Page 78: Cloud Deployments with Apache Hadoop and Apache HBase

Url Shortener schema

• All queries have at least one join.• Constraints when adding new urls, and short urls.• How do we delete users?

788/23/11  NoSQLNow! 2011 Conference

Page 79: Cloud Deployments with Apache Hadoop and Apache HBase

Url Shortener HBase schema

• Most common i i lqueries are single 

gets

• Use compression settings on content gcolumn families.

• Use composite  row key to group all of a 

’ h t d luser’s shortened urls

798/23/11  NoSQLNow! 2011 Conference

Page 80: Cloud Deployments with Apache Hadoop and Apache HBase

Web Tables

• Goal: Manage web crawls and its data by keeping snapshots of the web. • Google used BigTable for Web table exampleY h HB f W b l h• Yahoo uses HBase for Web crawl cache

Full scan applicationsHB

Random access applications

HBase

HDFS

808/23/11  NoSQLNow! 2011 Conference

Page 81: Cloud Deployments with Apache Hadoop and Apache HBase

Web Table queriesq

• Crawler continuously updating link and pages• Want to track individual pages over time• Want to group related pages from same site• Want to calculate PageRank (links and backlinks)• Want to do build a search index• Want to do ad‐hoc analytical queries on page content

818/23/11  NoSQLNow! 2011 Conference

Page 82: Cloud Deployments with Apache Hadoop and Apache HBase

Google web table schemag

828/23/11  NoSQLNow! 2011 Conference

Page 83: Cloud Deployments with Apache Hadoop and Apache HBase

Web table Schema Designg

• Want to keep related pages together• Reverse url so that related pages are near each otherReverse url so that related pages are near each other.• archive.cloudera.com => com.cloudera.archive• www.cloudera.com => com.cloudera.www

• Want to track pages over timeWant to track pages over time• reverseurl‐crawltimestamp: put all of same url together• Just scan a localized set of pages.

• Want to calculate pagerank (links and backlinks)Want to calculate pagerank (links and backlinks)• Just need links, so put raw content in different column family.• Avoid having to do IO to read unused raw content.

• Want to index newer pagesWant to index newer pages• Use Map Reduce on most recently crawled content.

• Want to do analytical queries• We’ll do a full scan with filters.We ll do a full scan with filters.

838/23/11  NoSQLNow! 2011 Conference

Page 84: Cloud Deployments with Apache Hadoop and Apache HBase

Facebook Messages (as of 12/10)g ( / )

• 15Bn/month message email, 1k = 14TB• 120Bn /month, 100 bytes = 11TB

Keyword search of Create a new / ti messagesmessage/conversation

List most recent Show full conversationsconversation

848/23/11  NoSQLNow! 2011 Conference

Page 85: Cloud Deployments with Apache Hadoop and Apache HBase

Possible Schema Designg

• Show my most recent conversations• Have a “conversation” table using user‐revTimeStamp as keyHave a  conversation  table using user revTimeStamp as key• Have a Metadata column family • Metadata contains date, to/from, one line of most recent

• Show me the full conversation• Use same “conversation” table• Content column family contains a conversation• We already have full row key from previous, so this is just a quick lookupS h i b f k d• Search my inbox for keywords • Have a separate “inboxSearch” table• Row key design: userId‐word‐messageId‐lastUpdateRevTimestamp

• Works for type‐ahead / partial messagesWorks for type ahead / partial messages• Show top N message ids with word

• Send new message • Update both tables and to both users’s rows• Update recent conversations and keyword index

858/23/11  NoSQLNow! 2011 Conference

Page 86: Cloud Deployments with Apache Hadoop and Apache HBase

Facebook MySQL to HBase transitiony Q

• Initially a normalized MySQL email schema sharded over 000 d i i h 001000 production servers with 500M users.

• How do we export users’ emails?Di t h• Direct approach:• Big join – point table for Tb’s of data (500M users!)• This would kill the production servers.This would kill the production servers.

• Incremental approach:• Snapshot copy via naïve bulk load into migration HBase cluster.• Have incremental fetches from db for new live data.• Use MR on migration HBase to do join, writing to final cluster.

A it t b th l til i ti l t• App writes to both places until migration complete.

868/23/11  NoSQLNow! 2011 Conference

Page 87: Cloud Deployments with Apache Hadoop and Apache HBase

Row Key tricksy

• Row Key design for schemas are critical• Reasonable number of regionsReasonable number of regions.• Make sure key distributes to spread write load.• Take advantage of lexicographic sort order.

• Numeric Keys and lexicographic sort• Store numbers big‐endian.• Pad ASCII numbers with 0’s.

• Use reversal to have most significant traits first.• Reverse URL.• Reverse timestamp to get most recent first.p g

• Use composite keys to make key distribute nicely and work well with sub‐scans• Ex: User‐ReverseTimeStamp• Do not use current timestamp as first part of row key!

878/23/11  NoSQLNow! 2011 Conference

Page 88: Cloud Deployments with Apache Hadoop and Apache HBase

Key Take‐awaysy y

• Denormalized schema localized data for single lookups.

• Rowkey is critical for lookups and subset scans.

• Make sure when writing row keys are distributed.

• Use Bulk loads and Map Reduce to re‐organize or change you schema (during down time).g y ( g )

• Multiple clusters for different workloads if you can p yafford it.

888/23/11  NoSQLNow! 2011 Conference

Page 89: Cloud Deployments with Apache Hadoop and Apache HBase

Outline

• Motivation • Enter Apache Hadoop • Enter Apache HBase• Real‐World Applications• System Architecture• Deployment (in the Cloud)• Conclusions

898/23/11  NoSQLNow! 2011 Conference

Page 90: Cloud Deployments with Apache Hadoop and Apache HBase

A Typical Look...yp

• 5‐4000 commodity servers(8‐core, 24GB RAM, 4‐12 TB, gig‐E)

• 2‐level network architecture• 20‐40 nodes per rack

908/23/11  NoSQLNow! 2011 Conference

Page 91: Cloud Deployments with Apache Hadoop and Apache HBase

Hadoop Cluster nodesp

Master nodes (1 each)

NameNode (metadata server and database)

JobTracker (scheduler)

Slave nodes (1‐4000 each)

DataNodes (block storage)

TaskTrackers (task execution)

918/23/11  NoSQLNow! 2011 Conference

Page 92: Cloud Deployments with Apache Hadoop and Apache HBase

Name Node and Secondary Name Nodey

• NameNode• The most critical node in the system• The most critical node in the system.• Stores file system metadata on disk and in memory.

• Directory structures, permissions• Modification stored as an edit log• Modification stored as an edit log.• Fault tolerant but not highly available yet.

• Secondary NameNode• Secondary NameNode• Not a hot standby!• Gets a copy of file system metadata and edit log.

d ll d d l d h d• Periodically compacts image and edit log and ships to NameNode.

• Make sure your DNS is setup properly!

928/23/11  NoSQLNow! 2011 Conference

Page 93: Cloud Deployments with Apache Hadoop and Apache HBase

Data nodes

• HDFS splits files into 64MB (or 128MB) blocks.

• Data nodes store and serves these blocks.• By default, pipeline writes to 3 different machines.• By default, local machine, at machines on other racks.

• Locality helps significantly on subsequent reads and computation schedulingcomputation scheduling.

938/23/11  NoSQLNow! 2011 Conference

Page 94: Cloud Deployments with Apache Hadoop and Apache HBase

Job Tracker and Task Trackers

• Now, we want to process that data!

• Job Tracker • Schedules work and resource usage through out the cluster.• Makes sure work gets done.

• Controls retry, speculative execution, etc.

• Task TrackersTask Trackers• These slaves do the “map” and “reduce” work.• Co‐located with data nodes.

948/23/11  NoSQLNow! 2011 Conference

Page 95: Cloud Deployments with Apache Hadoop and Apache HBase

HBase cluster nodes

Master nodes (1 each) Slave nodes (1‐4000 each)NameNode(HDFS metadata server)

HM tHMaster(region metadata) RegionServer

(table server)DataNode

HMaster(hot standby)

(hdfs block server)

Coordination nodes (odd number)

(hot standby)

ZooKeeper Quorum Peer

958/23/11  NoSQLNow! 2011 Conference

Page 96: Cloud Deployments with Apache Hadoop and Apache HBase

HMaster and ZooKeeperp

• HMaster• Controls which Regions are served by which Region Servers.• Assigns regions to new region servers when they arrive or go down.

• Can have a hot standby master if main master goes down.• All region state kept in ZooKeeper.

• Apache ZooKeeper• Highly Available System for coordination• Highly Available System for coordination.• Generally 3 or 5 machines (always an odd number).• Uses consensus to guarantee common shared state.• Writes are considered expensive.

968/23/11  NoSQLNow! 2011 Conference

Page 97: Cloud Deployments with Apache Hadoop and Apache HBase

Region Serverg

• Tables are chopped up into regions.• A region is only served by one region server at a time.• Regions are served by a “region server”.

• Load balancing if region server goes down.

• Co‐locate region servers with data nodes.• Takes advantage of HDFS file locality. 

• Important that clocks are in reasonable sync. Use NTP!

978/23/11  NoSQLNow! 2011 Conference

Page 98: Cloud Deployments with Apache Hadoop and Apache HBase

Stability and Tuning Hintsy g

• Monitor your cluster.• Avoid memory swapping. 

• Do not oversubscribe memory.• Can suffer from cascading failures• Can suffer from cascading failures.

• Mostly scan jobs?Mostly scan jobs? • Small read cache, low swapiness.

• Large max region size for large column families.g g g• Avoid costly “region splits”.

• Make the ZK timeout higher. • longer to recover from failure but prevents cascading failure.

988/23/11  NoSQLNow! 2011 Conference

Page 99: Cloud Deployments with Apache Hadoop and Apache HBase

Outline

• Motivation • Enter Apache Hadoop • Enter Apache HBase• Real‐World Applications• System Architecture• Deployment (in the Cloud)• Conclusions

998/23/11  NoSQLNow! 2011 Conference

Page 100: Cloud Deployments with Apache Hadoop and Apache HBase

Back to the cloud!

1008/23/11  NoSQLNow! 2011 Conference

Page 101: Cloud Deployments with Apache Hadoop and Apache HBase

We’ll use Apache Whirrp

Apache Whirr is set ofApache Whirr is set of tools and libraries for deploying clusters on cloud services in acloud services in a cloud‐neutral way.y

1018/23/11  NoSQLNow! 2011 Conference

Page 102: Cloud Deployments with Apache Hadoop and Apache HBase

This is great for setting up a cluster...g g p

jon@grimlock:~/whirr-0.5.0-incubating$ bin/whirr launch-cluster --config

i /hb 2 irecipes/hbase-ec2.properties

j i l k / hi i b i $jon@grimlock:~/whirr-0.5.0-incubating$ bin/whirr launch-cluster --configrecipes/scm ec2 propertiesrecipes/scm-ec2.properties

1028/23/11  NoSQLNow! 2011 Conference

Page 103: Cloud Deployments with Apache Hadoop and Apache HBase

and an easy way to tear down a cluster.y y

jon@grimlock:~/whirr-0.5.0-incubating$ bin/whirr destroy-cluster --config

i /hb 2 irecipes/hbase-ec2.properties

j i l k / hi i b i $jon@grimlock:~/whirr-0.5.0-incubating$ bin/whirr destroy-cluster --configrecipes/scm ec2 propertiesrecipes/scm-ec2.properties

1038/23/11  NoSQLNow! 2011 Conference

Page 104: Cloud Deployments with Apache Hadoop and Apache HBase

But how do you to manage a cluster deployment?y g p y

1048/23/11  NoSQLNow! 2011 Conference

Page 105: Cloud Deployments with Apache Hadoop and Apache HBase

Interact with your cluster with Huey

1058/23/11  NoSQLNow! 2011 Conference

Page 106: Cloud Deployments with Apache Hadoop and Apache HBase

What did we just do? j

• Whirr• Provisioned the machines on EC2.• Installed SCM on all the machines.

• Cloudera Service and Configuration Manager ExpressO h t t d th d l t th i i th d• Orchestrated the deployment the services in the proper order.

• ZooKeeper, Hadoop, HDFS, HBase and Hue

• Set service configurations.g• Free download for kicking off up to 50 nodes!http://www.cloudera.com/products‐services/scm‐express/

1068/23/11  NoSQLNow! 2011 Conference

Page 107: Cloud Deployments with Apache Hadoop and Apache HBase

To Cloud or not to Cloud?

• The key feature of a cloud deployment • Elasticity: The ability to expand and contract the number of machines being used on demand.

• Things to consider:E i f hi d l• Economics of machines and people.

• Capital Expenses vs Operational Expenses.

• Workload requirements: Performance / SLAs.q /• Previous investments.

1078/23/11  NoSQLNow! 2011 Conference

Page 108: Cloud Deployments with Apache Hadoop and Apache HBase

Economics of a clusterEconomics of a cluster

EC2 Cloud deployment Private ClusterEC2 Cloud deployment• 10 small instances

• 1 core,  1.7GB ram, 160GB disk• $0 085/hr/mchn => $7 446/yr

Private Cluster• 10 commodity servers

• 8 core, 24 GB ram, 6TB disk• $0.085/hr/mchn => $7,446/yr• Reserved $227.50/yr/machine 

+$0.03/hr/mchn=> $4903/yr.• $6500 /machine => $65,000• + physical space• + networking gear• 10 Dedicated‐Reserved Quad 

XLs  instances• 8 core, 23GB ram, 1690GB disk• $6600/yr/mchn +

• + networking gear• + power • + admin costs

• $6600/yr/mchn + $0.84/hr/mchn + $10/hr/region => 66,000 + 73,584 + 87,600 => $227 184/yr

• + more setup time

$227,184/yr

1088/23/11  NoSQLNow! 2011 Conference

Page 109: Cloud Deployments with Apache Hadoop and Apache HBase

Pros of using the cloudg

• With Virtualized machines, you can install any SW you !want!

• Great for bursty or occasional loads that expand and shrink.G t if d d t i l d i th l d• Great if your apps and data is already in the cloud.• Logs already live in S3 for example.

• Great if you don’t have a large ops team• Great if you don t have a large ops team.• Save money on people dealing with colos, hardware failures.• Steadier ops team personnel requirements, (unless catastrophic 

failure).

• Great for experimentation.f i / l l• Great for testing/QA clusters at scale.

1098/23/11  NoSQLNow! 2011 Conference

Page 110: Cloud Deployments with Apache Hadoop and Apache HBase

Cons of using the cloudg

• Getting data in and out of EC2. N t t b t t f ti• Not cost, but amount of time.

• AWS Direct connect can help.• Virtualization causes varying network connectivity.y g y

• ZooKeeper timeouts can cause cascading failure.• Some connections fast, others slow.• Dedicated or Cluster compute instance could improve• Dedicated or Cluster‐compute instance could improve.

• Virtualization causes unpredictable IO performance.• EBS is like a SAN and an eventual bottleneck.• Ephemeral disks perform better but not recoverable on failures.

• Still need to deal with Disaster recovery.• What happens if an EC2 or a region goes down? (4/21/11)• What happens if an EC2 or a region goes down? (4/21/11)

1108/23/11  NoSQLNow! 2011 Conference

Page 111: Cloud Deployments with Apache Hadoop and Apache HBase

Cloudera’s Experience with Hadoop in the Cloudp p

• Some Enterprise Hadoop/MR use the Cloud.( )• Good for daily jobs with moderate amounts of data (GB’s), 

generally computationally expensive.• Ex: Periodic matching or recommendation applications.Ex: Periodic matching or recommendation applications.

• Spin up cluster.• Upload a data set to S3.• Do an n2 matching or recommendation job. 

• Mapper expands data.• Reducer gets small amount of data back. g• Write to S3.

• Download result set.• Tear down cluster• Tear down cluster.

1118/23/11  NoSQLNow! 2011 Conference

Page 112: Cloud Deployments with Apache Hadoop and Apache HBase

Cloudera’s Experience with HBase in the cloudp

• Almost all enterprise HBase users use physical hardware.• Some initially used the cloud, but transitioned to physical 

hardware.O t• One story:• EC2: 40nodes in ec2 xl instances,• Bought 10 physical machines and got similar or better performance.Bought 10 physical machines and got similar or better performance. 

• Why?• Physical hardware gives more control of machine build out, network 

infrastructure, locality which are critical for performance.• HBase is up all the time and usually grows over time.

1128/23/11  NoSQLNow! 2011 Conference

Page 113: Cloud Deployments with Apache Hadoop and Apache HBase

Outline

• Motivation • Enter Apache Hadoop • Enter Apache HBase• Real‐World Applications• System Architecture• Deployment (in the Cloud)• Conclusions

1138/23/11  NoSQLNow! 2011 Conference

Page 114: Cloud Deployments with Apache Hadoop and Apache HBase

Key takeawaysy y

• Apache HBase is not a Database!  There are other scalable databasesdatabases.

• Query‐centric schema design, not data‐centric schema design.I d ti t 100’ f TB l t l l• In production at 100’s of TB scale at several large enterprises.

• If you are restructuring your SQL DB to optimize it, you may be a candidate for HBase.

• HBase complements and depends upon Hadoop.• Hadoop makes sense in the cloud for some productionHadoop makes sense in the cloud for some production 

workloads.• HBase in the cloud for experimentation but generally is in 

physical hardware for productionphysical hardware for production.

1148/23/11  NoSQLNow! 2011 Conference

Page 115: Cloud Deployments with Apache Hadoop and Apache HBase

HBase vs RDBMS

RDBMS HBaseData layout Row‐oriented Column‐family‐oriented

Transactions Multi‐row ACID Single row only

Query language SQL get/put/scan/etc *

Security Authentication/Authorization Work in progress

Indexes On arbitrary columns Row‐key only*

Max data si e TBs ~1PBMax data size TBs ~1PB

Read/write throughput limits

1000s queries/second Millions of “queries”/second

1158/23/11  NoSQLNow! 2011 Conference

Page 116: Cloud Deployments with Apache Hadoop and Apache HBase

HBase vs other “NoSQL”Q

• Favors Strict Consistency over Availability (but availability is good in practice!)

• Great Hadoop integration (very efficient bulk loads, M R d l i )MapReduce analysis)

• Ordered range partitions (not hash)• Automatically shards/scales (just turn on more servers, really proven at petabyte scale)S l ( k l )• Sparse column storage (not key‐value)

1168/23/11  NoSQLNow! 2011 Conference

Page 117: Cloud Deployments with Apache Hadoop and Apache HBase

HBase vs just HDFSj

Plain HDFS/MR HBaseWrite pattern Append‐only Random write, bulk 

incremental

Read pattern Full table scan, partition table  Random read, small range scan scan, or table scan

Hive (SQL) performance Very good 4‐5x slower

St t d t D it lf / TSV / S l f il d tStructured storage Do‐it‐yourself / TSV / SequenceFile / Avro / ?

Sparse column‐family data model

Max data size 30+ PB ~1PB

If you have neither random write nor random read, stick to HDFS!

1178/23/11  NoSQLNow! 2011 Conference

Page 118: Cloud Deployments with Apache Hadoop and Apache HBase

More resources?

• Download Hadoop and HBase!• CDH ‐ Cloudera’s Distribution including Apache Hadoophttp://cloudera com/http://cloudera.com/

• http://hadoop.apache.org/

• Try it out! (Locally, VM, or EC2)Try it out!  (Locally, VM, or EC2)• Watch free training videos onhttp://cloudera.com/p // /

1188/23/11  NoSQLNow! 2011 Conference

Page 119: Cloud Deployments with Apache Hadoop and Apache HBase

[email protected]@jmhsieh

QUESTIONS?

@jmhsieh

QUESTIONS?

1198/23/11  NoSQLNow! 2011 Conference

Page 120: Cloud Deployments with Apache Hadoop and Apache HBase