Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Preview:

Citation preview

Deploying Kafka at DropboxAlternately: how to handle 10,000,000 QPS in one cluster (but don't)

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

Your Speakers

• Mark Smith <zorkian@dropbox.com>

formerly of Google, Bump, StumbleUpon, etc

likes small airplanes and not getting paged

• Sean Fellows <fellows@dropbox.com>

formerly of Google

likes corgis and distributed systems

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

Dropbox

• Over 500 million signups

• Exabyte scale storage system

• Multiple hardware locations + AWS

Log Events

• Wide distribution (1,000 categories)

• Several do >1M QPS each + long tail

• About 200TB/day (raw)

• Payloads range from empty to 15MB JSON blobs

Current System

• Existing system based on Scribe + HDFS

• Aggregate to single destination for analytics

• Powers Hive and standard map-reduce type analytics

Want: real-time stream processing!

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

Initial Design

• One big cluster

• 20 brokers: 96GB RAM, 16x2TB disk, JBOD config

• ZK ensemble run separately (5 members)

• Kafka 0.8.2 from Github

• LinkedIn configuration recommendations

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

Unexpected Catastrophes

• Disks failure or reaching 100%

• Repair is manual, won't expire unless caught up

• Crash looping, controller load

• Simultaneous restarts

• Even graceful, recovery is sometimes very bad (even 0.9!)

• Rebalancing is dangerous

• Saturates disks, partitions fall out of ISRs, offline, etc

System Errors

• Controller issues

• Sometimes goes AWOL with e.g. big rebalances

• Can have multiple controllers (during serial operations)

• Cascading OOMs

• Too many connections

Lack of Tooling

• Usually left to the reader

• Few best practices

• But we love Kafka Manager

• More to come later!

Newer Clients

• State of Go/Python clients

• Bad behavior at scale

• Laserbeam, retries, backoff

• Too many connections == OOM

• Good clients take time

Bad Configs

• Many, many tunables -- lots of rope

• Unclean leader election

• Preferred leader automation

• Disk threads (thanks Gwen!)

• Little modern documentation on running at scale

• Todd Palino helped us out early, tho, so thank you!

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

Hardware

• Hardware RAID 10

• ~25TB usable/box (spinning rust)

• During broker replacement

• 200ms p99 commit latency down to 10ms!

• Failure tolerance, full disk protection

• Canary cluster

Monitoring

• MPS vs QPS (metadata reqs!)

• Bad Stuff graph

• Disk utilization/latency

• Heap usage

• Number of controllers

Tooling

• Rolling restarter (health checks!)

• Rate limited partition rebalancer (MPS)

• Config verifier/enforcer

• Coordinated consumption (pre-0.9)

• Auditing framework

Customer Culture

• Topics : organization :: partitions : scale

• Do not hash to partitions

• No ordering requirements

• Namespaces and ownership are required

Success! x

• Kafka goes fast (18M+ MPS on 20 brokers)

• Multiple parallel consumption

• Low latency (at high produce rates)

• 0.9 is leaps ahead of 0.8.2 (upgrade!)

• Supportable by a small team (at our scale)

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

The Future

• Big is fun but has problems

• Open source our tooling

• Moving towards replication

• Automatic up-partitioning and rebalancing

• Expanding auditing to clients

• Low volume latencies

Deploying Kafka at Dropbox

• Mark Smith <zorkian@dropbox.com>

• Sean Fellows <fellows@dropbox.com>

We would love to talk with other people who are running Kafka at similar

scales. Email us!

And... questions! (If we have time.)

Recommended