Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491...

Preview:

Citation preview

ApacheKa)a

CMSC491Hadoop-BasedDistributedCompu=ng

Spring2016AdamShook

Overview

•  Ka)aisa“publish-subscribemessagingrethoughtasadistributedcommitlog”

•  Fast•  Scalable•  Durable•  Distributed

Ka)aadop=onandusecases•  LinkedIn:ac=vitystreams,opera=onalmetrics,databus

–  400nodes,18ktopics,220Bmsg/day(peak3.2Mmsg/s),May2014•  Ne*lix:real-=memonitoringandeventprocessing•  Twi/er:aspartoftheirStormreal-=medatapipelines•  Spo4fy:logdelivery(from4hdownto10s),Hadoop•  Loggly:logcollec=onandprocessing•  Mozilla:telemetrydata•  Airbnb,Cisco,Gnip,InfoChimps,Ooyala,Square,Uber,…

3

HowfastisKa)a?•  “Upto2millionwrites/secon3cheapmachines”

–  Using3producerson3differentmachines,3xasyncreplica=on•  Only1producer/machinebecauseNICalreadysaturated

•  Sustainedthroughputasstoreddatagrows–  Slightlydifferenttestconfigthan2Mwrites/secabove.

4

WhyisKa)asofast?•  Fastwrites:

–  WhileKa)apersistsalldatatodisk,essen=allyallwritesgotothepagecacheofOS,i.e.RAM.

•  Fastreads:

–  Veryefficienttotransferdatafrompagecachetoanetworksocket–  Linux:sendfile()systemcall

•  Combina=onofthetwo=fastKa)a!–  Example(Opera=ons):OnaKa)aclusterwheretheconsumersaremostly

caughtupyouwillseenoreadac=vityonthedisksastheywillbeservingdataen=relyfromcache.

5

Afirstlook•  Thewhoiswho–  Producerswritedatatobrokers.

–  Consumersreaddatafrombrokers.

–  Allthisisdistributed.

•  Thedata–  Dataisstoredintopics.–  Topicsaresplitintopar44ons,whicharereplicated.

6

Afirstlook

7

Broker(s)

Topics

8

new

ProducerA1

ProducerA2

ProducerAn…

Producers always append to “tail” (think: append to a file)

Kafka prunes “head” based on age or max size or “key”

Oldermsgs Newermsgs

KaLatopic

•  Topic:feednametowhichmessagesarepublished–  Example:“zerg.hydra”

Broker(s)

Topics

9

new

ProducerA1

ProducerA2

ProducerAn…

Producers always append to “tail” (think: append to a file)

Oldermsgs Newermsgs

ConsumergroupC1 Consumers use an “offset pointer” to track/control their read progress

(and decide the pace of consumption) ConsumergroupC2

Par==ons

10

•  Atopicconsistsofpar44ons.•  Par==on:ordered+immutablesequenceofmessages

thatiscon=nuallyappendedto

Par==ons

11

•  #par==onsofatopicisconfigurable•  #par==onsdeterminesmaxconsumer(group)parallelism

–  cf.parallelismofStorm’sKa)aSpoutviabuilder.setSpout(,,N)

–  Consumer group A, with 2 consumers, reads from a 4-partition topic–  Consumer group B, with 4 consumers, reads from the same topic

Par==onoffsets

12

•  Offset:messagesinthepar==onsareeachassignedaunique(perpar==on)andsequen=alidcalledtheoffset–  Consumerstracktheirpointersvia(offset,par--on,topic)tuples

ConsumergroupC1

Replicasofapar==on

•  Replicas:“backups”ofapar==on– Theyexistsolelytopreventdataloss.– Replicasareneverreadfrom,neverwrinento.

•  TheydoNOThelptoincreaseproducerorconsumerparallelism!

– Ka)atolerates(numReplicas-1)deadbrokersbeforelosingdata•  LinkedIn:numReplicas==2à1brokercandie

13

Ka)aQuickstart

•  StepsfordownloadingKa)a,star=ngaserver,andcrea=ngaconsole-basedconsumer/producer

•  RequiresZooKeepertobeinstalledandrunning

•  hnps://ka)a.apache.org/documenta=on.html#quickstart

•  hnps://github.com/adamjshook/hadoop-demos/tree/master/ka)a

Recommended