Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Deploying Kafka at DropboxAlternately: how to handle 10,000,000 QPS in one cluster (but don't)

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

Your Speakers

• Mark Smith <zorkian@dropbox.com>

formerly of Google, Bump, StumbleUpon, etc

likes small airplanes and not getting paged

• Sean Fellows <fellows@dropbox.com>

formerly of Google

likes corgis and distributed systems

The Plan

• Welcome

• Use Case

• Initial Design

• Current Setup

• Future Plans

Dropbox

• Over 500 million signups

• Exabyte scale storage system

• Multiple hardware locations + AWS

Log Events

• Wide distribution (1,000 categories)

• Several do >1M QPS each + long tail

• About 200TB/day (raw)

• Payloads range from empty to 15MB JSON blobs

Current System

• Existing system based on Scribe + HDFS

• Aggregate to single destination for analytics

• Powers Hive and standard map-reduce type analytics

Want: real-time stream processing!

The Plan

• Welcome

• Use Case

• Initial Design

• Current Setup

• Future Plans

Initial Design

• One big cluster

• 20 brokers: 96GB RAM, 16x2TB disk, JBOD config

• ZK ensemble run separately (5 members)

• Kafka 0.8.2 from Github

• LinkedIn configuration recommendations

The Plan

• Welcome

• Use Case

• Initial Design

• Current Setup

• Future Plans

Unexpected Catastrophes

• Disks failure or reaching 100%

• Repair is manual, won't expire unless caught up

• Crash looping, controller load

• Simultaneous restarts

• Even graceful, recovery is sometimes very bad (even 0.9!)

• Rebalancing is dangerous

• Saturates disks, partitions fall out of ISRs, offline, etc

System Errors

• Controller issues

• Sometimes goes AWOL with e.g. big rebalances

• Can have multiple controllers (during serial operations)

• Cascading OOMs

• Too many connections

Lack of Tooling

• Usually left to the reader

• Few best practices

• But we love Kafka Manager

• More to come later!

Newer Clients

• State of Go/Python clients

• Bad behavior at scale

• Laserbeam, retries, backoff

• Too many connections == OOM

• Good clients take time

Bad Configs

• Many, many tunables -- lots of rope

• Unclean leader election

• Preferred leader automation

• Disk threads (thanks Gwen!)

• Little modern documentation on running at scale

• Todd Palino helped us out early, tho, so thank you!

The Plan

• Welcome

• Use Case

• Initial Design

• Current Setup

• Future Plans

Hardware

• Hardware RAID 10

• ~25TB usable/box (spinning rust)

• During broker replacement

• 200ms p99 commit latency down to 10ms!

• Failure tolerance, full disk protection

• Canary cluster

Monitoring

• MPS vs QPS (metadata reqs!)

• Bad Stuff graph

• Disk utilization/latency

• Heap usage

• Number of controllers

Tooling

• Rolling restarter (health checks!)

• Rate limited partition rebalancer (MPS)

• Config verifier/enforcer

• Coordinated consumption (pre-0.9)

• Auditing framework

Customer Culture

• Topics : organization :: partitions : scale

• Do not hash to partitions

• No ordering requirements

• Namespaces and ownership are required

Success! x

• Kafka goes fast (18M+ MPS on 20 brokers)

• Multiple parallel consumption

• Low latency (at high produce rates)

• 0.9 is leaps ahead of 0.8.2 (upgrade!)

• Supportable by a small team (at our scale)

The Plan

• Welcome

• Use Case

• Initial Design

• Current Setup

• Future Plans

The Future

• Big is fun but has problems

• Open source our tooling

• Moving towards replication

• Automatic up-partitioning and rebalancing

• Expanding auditing to clients

• Low volume latencies

Deploying Kafka at Dropbox

• Mark Smith <zorkian@dropbox.com>

• Sean Fellows <fellows@dropbox.com>

We would love to talk with other people who are running Kafka at similar

scales. Email us!

And... questions! (If we have time.)

Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Engineering

Segurança do Dropbox Business · Dropbox Business (Standard, Advanced e Enterprise) e do Dropbox Education. O Paper é um recurso do Dropbox Business e do Dropbox Education. Whitepaper

Kinesis vs-kafka-and-kafka-deep-dive

Something about Kafka - Why Kafka is so fast

Kafka Audit - Kafka Meetup - January 27th, 2015

Kafka Connect & Streams - the ecosystem around Kafka

Kafka Streams: Hands-on Session - ce.uniroma2.it · Kafka Streams Kafka Streams: • Kafka Streams is a client library for processing and analyzing data stored in Kafka • Supports

Kafka Tutorial: Kafka Security

Enterprise Kafka: Kafka as a Service

Cloudurablecloudurable.com/ppt/cloudurable-kafka-intro-with-simple-java-produc… · Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Fundamentals

Working with Kafka Advanced Consumers - Cloudurable · 2020-02-12 · Kafka / Cassandra Support in EC2/AWS. Kafka Training, Kafka Consulting, Kafka Tutorial Objectives Advanced Kafka

KAFKA, Franz - Os Melhores Contos de Kafka

Making a New Internal Backup - AssistiveWare · DropBox Integration! Toggle DropBox Integration to ON to be able to export to DropBox.! Exporting to the DropBox App! If the Dropbox

· Apache Kafka Introduction to Apache Kafka Apache Kafka Architecture explanation Practical Examples on Apache Kafka SCALA, PYTHON, SPARK Course Content

Kafka Reliability Guarantees ATL Kafka User Group

Dropbox Business security | A Dropbox whitepaper · • Dropbox Business (Standard, Advanced et Enterprise) • Dropbox Education Infrastructure Simples d'utilisation, nos interfaces

101 ways to configure kafka - badly (Kafka Summit)

PDF: Kafka Tutorial - Cloudurablecloudurable.com/ppt/cloudurable-kafka-tutorial-v1.pdf · Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ What is Kafka?

Formatted: Figure [PACKT] cm, Width: 21.59 cm, Height: 27 ... · Kafka 0.7.x Consumer Kafka 0.7.x Cluster Kafka Migration Kafka 0.8 Cluster Kafka 0.8 Producer Producer (Front End)

Installing and configuring Apache Kafka · 2020-03-07 · Apache Kafka Installing Kafka Installing Kafka Although you can install Kafka on a cluster not managed by Ambari, this chapter

Kafka to the Maxka - (Kafka Performance Tuning)