Data Freeway : Scaling Out to Realtime Author: Eric Hwang, Sam Rash {ehwang,rash}@fb.com{ehwang,rash}@fb.com Speaker : Haiping Wang [email protected]@gamil.com

Data Freeway : Scaling Out to Realtime

Author: Eric Hwang, Sam Rash {ehwang,rash}@fb.com

Speaker : Haiping Wang [email protected]

mailto:[email protected]





Agenda

» Data at Facebook» Realtime Requirements» Data Freeway System Overview» Realtime Components

› Calligraphus/Scribe› HDFS use case and modifications› Calligraphus: a Zookeeper use case › ptail› Puma

» Future Work

Big Data, Big Applications / Data at Facebook

» Lots of data› More than 500 million active users › 50 million users update their statuses at least once each day› More than 1 billion photos uploaded each month › More than 1 billion pieces of content (web links, news stories,

blog posts, notes, photos, etc.) shared each week› Data rate: over 7 GB / second

» Numerous products can leverage the data› Revenue related: Ads Targeting› Product/User Growth related: AYML, PYMK, etc› Engineering/Operation related: Automatic Debugging› Puma: streaming queries

Example: User related Application

» Major challenges: Scalability , Latency

Realtime Requirements

› Scalability: 10-15 GBytes/second

› Reliability: No single point of failure

› Data loss SLA: 0.01%

• Loss due to hardware: means at most 1 out of 10,000 machines

can lose data

› Delay of less than 10 sec for 99% of data

• Typically we see 2s

› Easy to use: as simple as ‘tail –f /var/log/my-log-file’

Data Freeway System Diagram

» Scribe & Calligraphus get data into the system» HDFS at the core» Ptail provides data out» Puma is a emerging streaming analytics platform

Scribe

• Scalable distributed logging framework

• Very easy to use:

• scribe_log(string category, string message)

• Mechanics:

• Built on top of Thrift

• Runs on every machine at Facebook, Collect the log data into a bunch of

destinations

• Buffer data on local disk if network is down

• History:

• 2007: Started at Facebook

• 2008 Oct: Open-sourced

Calligraphus

» What

› Scribe-compatible server written in Java

› Emphasis on modular, testable code-base, and

performance

» Why?

› Extract simpler design from existing Scribe

architecture

› Cleaner integration with Hadoop ecosystem

• HDFS, Zookeeper, HBase, Hive

» History

› In production since November 2010

› Zookeeper integration since March 2011

HDFS : a different use case

» Message hub

› Add concurrent reader support and sync

› Writers + concurrent readers a form of pub/sub model

HDFS : add Sync

» Sync

› Implement in 0.20 (HDFS-200)

• Partial chunks are flushed

• Blocks are persisted

› Provides durability

› Lowers write-to-read latency

HDFS : Concurrent Reads Overview

» Without changes, stock

Hadoop 0.20 does not

allow access to the block

being written

» Need to read the block

being written for realtime

apps in order to achieve

< 10s latency

HDFS : Concurrent Reads Implementation

1.DFSClient asks

Namenode for blocks

and locations

2.DFSClient asks

Datanode for length of

block being written

3.opens last block

Calligraphus: Log Writer

Calligraphus Servers

HDFSScribe categories

ServerServer

ServerServer

ServerServer

Category 1Category 1



?

How to persist to HDFS?

Calligraphus (Simple)



Number of categories

Number of servers

Total number of directories

x =

ServerServer

ServerServer

ServerServer






Number of categories

Total number of directories

=




RouterRouter

RouterRouter

RouterRouter

WriterWriter

WriterWriter

WriterWriter

Calligraphus (Stream Consolidation)

ZooKeeperZooKeeper

ZooKeeper: Distributed Map

» Design

› ZooKeeper paths as tasks (e.g. /root/<category>/<bucket>)

› Cannonical ZooKeeper leader elections under each bucket for

bucket ownership

› Independent load management – leaders can release tasks

› Reader-side caches

› Frequent sync with policy db

AA

11 5522 33 44

BB

11 5522 33 44

CC

11 5522 33 44

DD

11 5522 33 44

RootRoot

Canonical Realtime ptail Application

Hides the fact we have many HDFS

instances: user can specify a category

and get a stream

Check pointing

Puma

Puma Overview

» Realtime analytics platform

» Metrics

› count, sum, unique count, average, percentile

» Uses ptail check pointing for accurate calculations in the case

of failure

» Puma nodes are sharded by keys in the input stream

» HBase for persistence

Puma Write Path

Puma Read Path

» Performance

› Elapsed time typically 200-300 ms for 30 day queries

› 99th percentile, cross-country, < 500ms for 30 day queries

Future Work

» Puma

› Enhance functionality: add application-level transactions on Hbase

› Streaming SQL interface

» Compression

Documents

Data Freeway : Scaling Out to Realtime Author: Eric Hwang, Sam Rash {ehwang,rash}@fb.com{ehwang,rash}@fb.com Speaker : Haiping Wang [email protected]@gamil.com