Upload
jairam-chandar
View
4.173
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Slides from the presentation at Hadoop UK User group meetup in London as part of BigDataWeek.
Citation preview
Hadoop At
Datasift
About me
Jairam ChandarBig Data Engineer
Datasift
@jairamc
http://about.me/jairam
http://blog.jairam.me
Outline
What is Datasift?
Where do we use Hadoop?
– The Numbers– The Use-cases– The Lessons
!! Sales Pitch Alert !!
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
The Numbers
Machines
– 60 machines ● Datanode● Tasktracker● RegionServer
– 2 machines● Namenode
– 2 machines● HBase Master
– In the processing of doubling our capacity
The Numbers
Machines
– 2 * Intel Xeon E5620 @ 2.40GHz (16 core total)
– 24GB RAM
– 6 * 2 TB disks in JBOD (small partition on frst disk for OS, rest is storage)
– 1 Gigabit network links
The Numbers
Data
– Avg load of 3500 interactions/second
– Peak load of 6000 interactions/second
– Highest during the Superbowl – 12000 interactions/second
– Avg size of interaction 2 KB – thats 2 TB a day with replication (RF = 3)
– And that's not it!
The Use Cases
HBase
– Recordings– Archive/Ultrahose
Map/Reduce
– Exports– Historics
The Use Cases
Recordings– User defned streams
– Stored in HBase for later retrieval
– Export to multiple output formats and stores
– <recording-id><interaction-uuid>● Recording-id is a SHA-1 hash● Allows recordings to be distributed by their key
without generating hot-spots.
The Use Cases
Recordings continued ...
The Use Cases
Exporter– Export data from HBase for customer
– Export fles 5 – 10 GB or 3-6 million records
– MR over HBase using TableInputFormat
– But the data needs to be sorted● TotalOrderPartioner
The Use Cases
Exporter Continued
!! Sales Pitch Alert !!
Historics
The Use Cases
Archive/Ultrahose– Not just the Firehose but the Ultrahose
– Stored in HBase as well
– HBase architecture (BigTable) creates Hotspots with Time Series data
● Leading randomizing bit (see HBaseWD)● Pre-split regions● Concurrent writes
The Use Cases
Archive continued …
2 years of Tweets
– 11 TB compressed
– <Number of tweets we got>
The Use Cases
Historics– Export archive data
– Slightly different from Exporter● Much larger time lines (1 – 3 months)● Unfltered Input Data● Therefore longer processing time● Hence more optimizations required
The Use Cases
Historics continued ...
The Lessons - HBase
Tune Tune Tune (Default == BAD)
Based on use case tune -
– Heap– Block Size– Memstore size
Keep number of column families low
Be aware of hot-spotting issue when writing time-series data
Use compression (eg. Snappy)
The Lessons - HBase
Ops need intimate understanding of system
Monitor metrics (GC, CPU, Compaction, I/O)
Don't be afraid to fddle with HBase code
Using a distribution is advisable
Questions?