Apache distributed log @ q con 2017

Preview:

Citation preview

Building reliable real-time serviceswith Apache DistributedLog

@sijieg

Logs are Everywhere● DB Storage Engines - WAL

● DB Replication - Binlog, Log shipping

● Distributed Consensus - Replicated log

● Messaging/Pub-Sub - Kafka

Apache DistributedLog

Log StreamAn endless, totally ordered,

sequence of immutable records

Log Stream

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

Sequence Numbers - DLSN

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

DLSN - System Sequence Number

Sequence Numbers - Transaction ID

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

DLSN - System Sequence Number

Transaction ID - Application Sequence Number

E.g. Offset or Timestamp

Sequence Numbers - Sequence ID

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

DLSN - System Sequence Number

Transaction ID - Application Sequence Number

E.g. Offset or Timestamp

Sequence ID

Writer & Readers

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

New records added here

Tailing Reads(close to head of stream)

Catching-up Reads(rewind to any positions)

Read Parallelism

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

Read from multiple positions in parallel

Log Segments

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

Log SegmentX

Log SegmentX+1

Log SegmentX+2

Log Segment Store

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

Log SegmentX

Log SegmentX+1

Log SegmentX+2

Apache BookKeeper

Log Stream Metadata

1 2 3 4 5 6 7 11

12

13

14

15

16

17

Oldest Newest

Writer Reader

Reader

Reader

- List of segments- Transaction Id Index- Truncation point- ...

Stream Metadata

Updates Notifications

Namespace

1 2 3 4 5 6 7 11

12

13

14

15

16

17

Oldest Newest

Writer Reader

Reader

Reader

/manhattan/stream-x.../ads/stream_xxx/ads/stream_yyy

Namespace

Lookup

ArchitectureM

etad

ata

Stor

e

Log SegmentStore(BK)

- Segments

ArchitectureM

etad

ata

Stor

e

Log SegmentStore(BK)

Log Streams - Abstraction & Naming- Data Management

- Efficient Write & Read- Intra-cluster & Geo Replication

- Segments

- Raw Streams

ArchitectureM

etad

ata

Stor

e

Log SegmentStore(BK)

Log Streams - Abstraction & Naming- Data Management

- Efficient Write & Read- Intra-cluster & Geo Replication

- Segments

- Raw Streams

WriteProxy

ReadProxy

- Ownership Tracking- Batching, Compression

Record Cache -Rate Limiting, Quota -

- Serving

ArchitectureM

etad

ata

Stor

e

Log SegmentStore(BK)

ColdStorage(HDFS)

Log Streams - Abstraction & Naming- Data Management

- Efficient Write & Read- Intra-cluster & Geo Replication

- Segments

- Raw Streams

WriteProxy

ReadProxy

- Ownership Tracking- Batching, Compression

Record Cache -Rate Limiting, Quota -

- Serving

Data Flow

WriteClient

WriteProxy Bookie

Bookie

Bookie

ReadProxy

ReadClient

ReadClient

ReadClient

1. write records

4. acknowledge

2. transmit buffer

3. Flush -Write a batched entry to bookies

5. Commit -Write Control

Record6. Long poll read

7. Speculative Read

8. Cache Records

9. Long poll read

Consensus

Consensus - Primary Leader Approach

Consensus - Log Replication

Consensus - Safety Ensurance● Election Safety - CAS operation on metadata store

○ Log Segment Sequence Number monotonically increase○ A log segment sequence number is guaranteed to only hand over to a

writer once● Log Segment Append-Only

○ A writer can only append entries to the log segment that is allocated to it

● Fencing - Termination mechanism of a log segment○ No entries can be appended to a log segment if it is fenced

User Cases

ArchitectureM

etad

ata

Stor

e

Log SegmentStore(BK)

ColdStorage(HDFS)

Log Streams - Abstraction & Naming- Data Management

- Efficient Write & Read- Intra-cluster & Geo Replication

- Segments

- Raw Streams

WriteProxy

ReadProxy

- Ownership Tracking- Batching, Compression

Record Cache -Rate Limiting, Quota -

- Serving

- Applications

- Different

Consumer

models

DBs - e.g.,Twitter’s

Manhattan

DeferredRPC

(queuing)

Self-servePub/Sub

StreamComputing

Cross DCReplication

DatabaseStronger Consistency

Stronger Consistency in Manhattan

MHCoordinator

MHCoordinator

MHCoordinator

1 2 3 4 5 6 7 11

12

13

14

15

16

17

Oldest Newest

MHReplica

MHReplica

MHReplica

1

2

3

Self-Serve Pub/SubMessage Delivery

Topic

Partitioned Pub/Sub

1 2 3 4 5 6 7 11

12

13

14

15

16

17

18

19

20

21

22

1 2 3 4 5 6 7 11

12

13

14

15

16

17

18

19

20

21

22

1 2 3 4 5 6 7 11

12

13

14

15

16

17

18

19

20

21

22

New messages appended here

Reads from anyposition- last position stored in offset store

- rewind to any positions-rewind by time (e.g 15 mins ago)

Deferred RPCReliable Queuing

Reliable RPC System

E D E A E D A A E D E A E D A E D E A E D A

WebServer

RPCQueue

RPCWorker

RPCWorker

Service A

Service B

Service C1

2

3

4

Scale at Twitter

Performance - Basic (GCP)● Disk & Network Bound● 1 Journal Disk + 5 Ledger Diks● Each Disk can write/read at ~220MB/second● 6 log streams, 1 write proxy + 3 bookies● 1 writer + 1 tailing reader => 2 million records/second● 3 catch-up raders => 7.5 million records/second● End-to-End Latency : within 30ms when network is

around 30% untilized

Performance - Effect of Record Size

Applications at Twitter● Manhattan Key/Value Store - Stronger Consistency● Durable Deferred RPC - Journal● Real-time search indexing - Change propagation● Self-serve Pub/Sub - Message Delivery, Ads Pipeline● Stream Computing

○ Source & Sink○ Stateful Processing in Heron (coming soon)

● Reliable cross datacenter replication● ...

Scale at Twitter● O(1) trillion records per day, O(10) petabytes per day

● O(10) thousands streams, O(1) million live log segments

● O(10^2) bookies, O(10^3) proxies

● Record size from 100 bytes to 20KB to even more

● Data is kept from hours to days, even up to a year

Future

Not Just Messaging● Stream - Events between services

○ Persistent

○ Rewindable

○ Replayable

○ Time independent

● Unification of Messaging and Storage

Apache DistributedLog (incubating)● Open sourced on 05/09/2016.● Landed at Apache Incubator on 06/25/2016.● Website

○ http://distributedlog.io/○ http://incubator.apache.org/projects/distributedlog.ht

ml● Code -

https://github.com/apache/incubator-distributedlog

Apache DistributedLog (incubating)● Mail List -

dev@distributedlog.incubator.apache.org● Jira - https://issues.apache.org/jira/browse/DL● Project Ideas -

https://cwiki.apache.org/confluence/display/DL/Project+Ideas

● Paper: “DistributedLog: A high performance replicated log service” (ICDE 2017)

Q/A● Twitter: @sijieg

● Email: sijie@apache.org

Appendix● Kafka vs DistributedLog

Kafka vs DL - Overall

Kafka vs DL - Data Segmentation

Kafka vs DL - Data Retention● Kafka

○ Time based Retention○ Log compaction by keys

● DL○ Time based Retention (messaging)○ Explicit truncation (database, replicated state machines)

Kafka vs DL - Cluster Expand● Kafka - Partition Rebalance

○ Adding new brokers○ Partitions outgrow of brokers’ capacity○ Adding new partitions

● DL○ New log segments will automatically allocated to new

storage nodes○ Scaling proxies (cpu, memory) independent of scaling

storage

Kafka vs DL - Writer● Kafka

○ Multiple-Writers Semantic via Brokers● DL

○ Multiple-Writers Semantic via Write Proxies (messaging)

○ Single-Writer Semantic using Core Library (database, replicated state machines)■ Fencing, Exclusive Writer

Kafka vs DL - Reader● Kafka

○ Both writes and reads are served by the leader brokers○ Polling

● DL○ Reads from any storage replicas○ Long poll + Speculative Reads

Kafka vs DL - Replication Scheme● Kafka

○ ISR Replication○ Follower brokers catchup with Leader broker

● DL○ Quorum-Vote Replication○ Ack Quorum is adjustable○ Replication Repair

Kafka vs DL - Storage/Durability● Kafka

○ File (set of files) per partition○ Only write to filesystem page cache

● DL (BookKeeper)○ Interleaved Storage○ All writes are persisted to disk via explicit fsync before

acknowledges○ Physical I/O Isolation

Recommended