34
STORM Buckle up Dorothy !!!

Storm overview & integration

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Storm overview & integration

STORM

Buckle up Dorothy !!!

Page 2: Storm overview & integration

Distributed real-time computation

ABOUT

By Nathan MarzBacktype => Twitter => Apache

Page 3: Storm overview & integration

Real-time analytics

WHAT IS IT GOOD FOR?

Online machine learningContinuous computationDistributed RPCETL (Extract, Transform, Load)…

Page 4: Storm overview & integration

No data loss

Fault-tolerantScalable

PROMISES

Robust

Page 5: Storm overview & integration

VIEW FROM ABOVE

StorageTopologyStreamSource

Storm Cluster

Pull

(Kafka,*MQ, …)

Read/Write

Page 6: Storm overview & integration

PRIMITIVES

Field 1 / Value 1

Field 2 / Value 2

Field 3 / Value 3

Field 4 / Value 4

Field 5 / Value 5

Tuple

Tuple Tuple Tuple Tuple

Stream

Page 7: Storm overview & integration

TopologyBolt

PRIMITIVES

SpoutT T

T

Bolt

Spout

Bolt

Bolt

TT

T

T TT

T TT

TT

T

TT

T

T

Page 8: Storm overview & integration

ABSTRACTION

PRIMITIVES

TuplesFilters

TransformationIncrementalDistributedScalable

FunctionsJoins

Chaining streamsSmall components

EFFECTS

SpoutsBolts

Page 9: Storm overview & integration

CLUSTER

Nimbus Zookeeper Cluster

Worker Node

Executor

Supervisor

Executor

Executor

Worker Node

Executor

Supervisor

Executor

Executor

Worker Node

Executor

Supervisor

Executor

Executor

Page 10: Storm overview & integration

NIMBUS / NODES

CLUSTER

SmallNo state

CommunicationStateRobustKill / Restart easy

ZOOKEEPER

Page 11: Storm overview & integration

No data loss

Fault-tolerantScalable

AS PROMISED?

Robust

Page 12: Storm overview & integration

GUARANTEES

Message transforms into a tuple treeStorm tracks tuple treeFully processed when tree exhausted

Page 13: Storm overview & integration

FAILURES

Task died – failed tuples replayedAcker task died – related tuples timeout and are replayedSpout task died – source replays, e.g. pending messages are placed back on the queue

Page 14: Storm overview & integration

WHAT DO I HAVE TO DO?

Inform about new links in treeInform when finished with a tupleEvery tuple must be acked or failed

Page 15: Storm overview & integration

TRIDENT

ANYTHING SIMPLER?

High level abstractionStateful persistence primitives Exactly-once semantics

Page 16: Storm overview & integration

AS PROMISED?

YES

Page 17: Storm overview & integration

USER DASHBOARD

PROBLEMBad performanceUses core storage

Pre-computeCustomizeFast

IDEA

IsolateQuarterly agg.

Page 18: Storm overview & integration

ARCHITECTURE

Core

Events Queue

Kafka

4 Partitions 2 Replicas

Storm

4 Workers

MS SQL

4 Staging

Dashboard

Push

Pull Write

Read

State in source

Page 19: Storm overview & integration

KAFKA

987654321

New

Client

Topic StackedFlushedClient offsetReplicated

Old

PartitionedFast

Page 20: Storm overview & integration

TRANSFORMATION

ORIGINAL{ id: df45er87c78df, sender: “Info”, destination: “39345123456”, parts: 2, price: 100, client: “Demo”, time: “2014-06-02 14:47:58”, country: “IT”, network: “Wind”, type: “SMS”, …}

{ client: “Demo”, type: “SMS”, country: “IT”, network: “Wind”, bucket: “2014-06-02 14:45:00”, traffic: 2, expenses: 200}

COMPUTED

Page 21: Storm overview & integration

CODE

TridentState tridentState = topology .newStream("CoreEvents", buildKafkaSpout()) .parallelismHint(4) .each( new Fields("bytes"), new CoreEventMessageParser(), new Fields("time", "client", "network", "country", "type", "parts", "price")) .each( new Fields("time"), new QuarterTimeBucket(), new Fields("bucket")) .project(new Fields("bucket", "client", "network", "country", "type", "traffic", "expenses“)) .groupBy(new Fields("bucket", "client", "network", "country", "type")) .persistentAggregate(getStateFactory(), new Fields("traffic", "expenses"), new Sum(), new Fields("trafficExpenses")) .parallelismHint(8);

Page 22: Storm overview & integration

PERFORMANCE

1.500

PEAK

REGULARKAFKA 60.000

4.500 160.000STORA

GE2.000 10.000

DASHBOARD

1 1

Page 23: Storm overview & integration

TUNING STORAGE

1st Issue - StorageRandom access – 1.500 w/s limitStaged approach – 30.000 w/s limit

No locks – isolatedScalable – each worker it’s stageMain table indexing nicelyDoesn’t affect reading

Page 24: Storm overview & integration

STAGED WRITES

Worker 1

Main Table

MergeWorker 2

Stage Table 1

Stage Table 2

MergeWrite

Write

Page 25: Storm overview & integration

TUNING TOPOLOGY

2nd Issue - Serialization

Raw/s Expanded/s Writes/s0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

200 KB1 MB4 MB8 MB16 MB24 MB

Plateauing

Page 26: Storm overview & integration

SERIALIZATION

S [s] S [byte] S [% CPU] D [s] D [% CPU]0

200

400

600

800

1,000

1,200

CSV (Plain)CSV (Deflate)CSV (GZip)Jackson (Plain)Jackson (GZip)Jackson SmileJava ObjectKryo

Page 27: Storm overview & integration

MEASURE

AXISMax spout pendingSQL workers

Kafka fetch speedDB write speedKafka / DB ratioCapacity

DB batch sizeKafka fetch size

Latency

METRICS

Serialization…

Page 28: Storm overview & integration

MONITOR

STORM UI TOPOLOGY

Page 29: Storm overview & integration

METRICS

GRAPHITE

Page 30: Storm overview & integration

GOTCHAS

Version 0.9.1Partially in fluxKafka integrationMessage & topology versioningPerformance tuning

Page 31: Storm overview & integration

Lambda Architecture

NEXT?

Master Dataset

Real-time Views

Serving LayerBatch Layer

Speed Layer

NewData

Query

Query

Batch Views

Page 32: Storm overview & integration

http://storm.incubator.apache.org

RESOURCES

http://lambda-architecture.nethttp://kafka.apache.org

Page 33: Storm overview & integration

http://www.gimp.org

PRESENTATION TOOLS

http://www.pictaculous.com

http://www.colourlovers.comhttp://www.easycalculation.com

http://paletton.com

Page 34: Storm overview & integration

QUESTIONS?