Upload
tori-tetlow
View
220
Download
2
Embed Size (px)
Citation preview
Data Freeway : Scaling Out to Realtime
Author: Eric Hwang, Sam Rash {ehwang,rash}@fb.com
Speaker : Haiping Wang [email protected]
Agenda
» Data at Facebook» Realtime Requirements» Data Freeway System Overview» Realtime Components
› Calligraphus/Scribe› HDFS use case and modifications› Calligraphus: a Zookeeper use case › ptail› Puma
» Future Work
Big Data, Big Applications / Data at Facebook
» Lots of data› More than 500 million active users › 50 million users update their statuses at least once each day› More than 1 billion photos uploaded each month › More than 1 billion pieces of content (web links, news stories,
blog posts, notes, photos, etc.) shared each week› Data rate: over 7 GB / second
» Numerous products can leverage the data› Revenue related: Ads Targeting› Product/User Growth related: AYML, PYMK, etc› Engineering/Operation related: Automatic Debugging› Puma: streaming queries
Example: User related Application
» Major challenges: Scalability , Latency
Realtime Requirements
› Scalability: 10-15 GBytes/second
› Reliability: No single point of failure
› Data loss SLA: 0.01%
• Loss due to hardware: means at most 1 out of 10,000 machines
can lose data
› Delay of less than 10 sec for 99% of data
• Typically we see 2s
› Easy to use: as simple as ‘tail –f /var/log/my-log-file’
Data Freeway System Diagram
» Scribe & Calligraphus get data into the system» HDFS at the core» Ptail provides data out» Puma is a emerging streaming analytics platform
Scribe
• Scalable distributed logging framework
• Very easy to use:
• scribe_log(string category, string message)
• Mechanics:
• Built on top of Thrift
• Runs on every machine at Facebook, Collect the log data into a bunch of
destinations
• Buffer data on local disk if network is down
• History:
• 2007: Started at Facebook
• 2008 Oct: Open-sourced
Calligraphus
» What
› Scribe-compatible server written in Java
› Emphasis on modular, testable code-base, and
performance
» Why?
› Extract simpler design from existing Scribe
architecture
› Cleaner integration with Hadoop ecosystem
• HDFS, Zookeeper, HBase, Hive
» History
› In production since November 2010
› Zookeeper integration since March 2011
HDFS : a different use case
» Message hub
› Add concurrent reader support and sync
› Writers + concurrent readers a form of pub/sub model
HDFS : add Sync
» Sync
› Implement in 0.20 (HDFS-200)
• Partial chunks are flushed
• Blocks are persisted
› Provides durability
› Lowers write-to-read latency
HDFS : Concurrent Reads Overview
» Without changes, stock
Hadoop 0.20 does not
allow access to the block
being written
» Need to read the block
being written for realtime
apps in order to achieve
< 10s latency
HDFS : Concurrent Reads Implementation
1.DFSClient asks
Namenode for blocks
and locations
2.DFSClient asks
Datanode for length of
block being written
3.opens last block
Calligraphus: Log Writer
Calligraphus Servers
HDFSScribe categories
ServerServer
ServerServer
ServerServer
Category 1Category 1
Category 2Category 2
Category 3Category 3
?
How to persist to HDFS?
Calligraphus (Simple)
Calligraphus Servers
HDFSScribe categories
Number of categories
Number of servers
Total number of directories
x =
ServerServer
ServerServer
ServerServer
Category 1Category 1
Category 2Category 2
Category 3Category 3
Calligraphus Servers
HDFSScribe categories
Number of categories
Total number of directories
=
Category 1Category 1
Category 2Category 2
Category 3Category 3
RouterRouter
RouterRouter
RouterRouter
WriterWriter
WriterWriter
WriterWriter
Calligraphus (Stream Consolidation)
ZooKeeperZooKeeper
ZooKeeper: Distributed Map
» Design
› ZooKeeper paths as tasks (e.g. /root/<category>/<bucket>)
› Cannonical ZooKeeper leader elections under each bucket for
bucket ownership
› Independent load management – leaders can release tasks
› Reader-side caches
› Frequent sync with policy db
AA
11 5522 33 44
BB
11 5522 33 44
CC
11 5522 33 44
DD
11 5522 33 44
RootRoot
Canonical Realtime ptail Application
Hides the fact we have many HDFS
instances: user can specify a category
and get a stream
Check pointing
Puma
Puma Overview
» Realtime analytics platform
» Metrics
› count, sum, unique count, average, percentile
» Uses ptail check pointing for accurate calculations in the case
of failure
» Puma nodes are sharded by keys in the input stream
» HBase for persistence
Puma Write Path
Puma Read Path
» Performance
› Elapsed time typically 200-300 ms for 30 day queries
› 99th percentile, cross-country, < 500ms for 30 day queries
Future Work
» Puma
› Enhance functionality: add application-level transactions on Hbase
› Streaming SQL interface
» Compression