Storm overview & integration

STORM

Buckle up Dorothy !!!

Distributed real-time computation

ABOUT

By Nathan MarzBacktype => Twitter => Apache

Real-time analytics

WHAT IS IT GOOD FOR?

Online machine learningContinuous computationDistributed RPCETL (Extract, Transform, Load)…

No data loss

Fault-tolerantScalable

PROMISES

Robust

VIEW FROM ABOVE

StorageTopologyStreamSource

Storm Cluster

Pull

(Kafka,*MQ, …)

Read/Write

PRIMITIVES

Field 1 / Value 1

Field 2 / Value 2

Field 3 / Value 3

Field 4 / Value 4

Field 5 / Value 5

Tuple

Tuple Tuple Tuple Tuple

Stream

TopologyBolt

PRIMITIVES

SpoutT T

T

Bolt

Spout

Bolt

Bolt

TT

T

T TT

T TT

TT

T

TT

T

T

ABSTRACTION

PRIMITIVES

TuplesFilters

TransformationIncrementalDistributedScalable

FunctionsJoins

Chaining streamsSmall components

EFFECTS

SpoutsBolts

CLUSTER

Nimbus Zookeeper Cluster

Worker Node

Executor

Supervisor

Executor

Executor

Worker Node

Executor

Supervisor

Executor

Executor

Worker Node

Executor

Supervisor

Executor

Executor

NIMBUS / NODES

CLUSTER

SmallNo state

CommunicationStateRobustKill / Restart easy

ZOOKEEPER

No data loss

Fault-tolerantScalable

AS PROMISED?

Robust

GUARANTEES

Message transforms into a tuple treeStorm tracks tuple treeFully processed when tree exhausted

FAILURES

Task died – failed tuples replayedAcker task died – related tuples timeout and are replayedSpout task died – source replays, e.g. pending messages are placed back on the queue

WHAT DO I HAVE TO DO?

Inform about new links in treeInform when finished with a tupleEvery tuple must be acked or failed

TRIDENT

ANYTHING SIMPLER?

High level abstractionStateful persistence primitives Exactly-once semantics

AS PROMISED?

YES

USER DASHBOARD

PROBLEMBad performanceUses core storage

Pre-computeCustomizeFast

IDEA

IsolateQuarterly agg.

ARCHITECTURE

Core

Events Queue

Kafka

4 Partitions 2 Replicas

Storm

4 Workers

MS SQL

4 Staging

Dashboard

Push

Pull Write

Read

State in source

KAFKA

987654321

New

Client

Topic StackedFlushedClient offsetReplicated

Old

PartitionedFast

TRANSFORMATION

ORIGINAL{ id: df45er87c78df, sender: “Info”, destination: “39345123456”, parts: 2, price: 100, client: “Demo”, time: “2014-06-02 14:47:58”, country: “IT”, network: “Wind”, type: “SMS”, …}

{ client: “Demo”, type: “SMS”, country: “IT”, network: “Wind”, bucket: “2014-06-02 14:45:00”, traffic: 2, expenses: 200}

COMPUTED

CODE

TridentState tridentState = topology .newStream("CoreEvents", buildKafkaSpout()) .parallelismHint(4) .each( new Fields("bytes"), new CoreEventMessageParser(), new Fields("time", "client", "network", "country", "type", "parts", "price")) .each( new Fields("time"), new QuarterTimeBucket(), new Fields("bucket")) .project(new Fields("bucket", "client", "network", "country", "type", "traffic", "expenses“)) .groupBy(new Fields("bucket", "client", "network", "country", "type")) .persistentAggregate(getStateFactory(), new Fields("traffic", "expenses"), new Sum(), new Fields("trafficExpenses")) .parallelismHint(8);

PERFORMANCE

1.500

PEAK

REGULARKAFKA 60.000

4.500 160.000STORA

GE2.000 10.000

DASHBOARD

1 1

TUNING STORAGE

1st Issue - StorageRandom access – 1.500 w/s limitStaged approach – 30.000 w/s limit

No locks – isolatedScalable – each worker it’s stageMain table indexing nicelyDoesn’t affect reading

STAGED WRITES

Worker 1

Main Table

MergeWorker 2

Stage Table 1

Stage Table 2

MergeWrite

Write

TUNING TOPOLOGY

2nd Issue - Serialization

Raw/s Expanded/s Writes/s0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

200 KB1 MB4 MB8 MB16 MB24 MB

Plateauing

SERIALIZATION

S [s] S [byte] S [% CPU] D [s] D [% CPU]0

200

400

600

800

1,000

1,200

CSV (Plain)CSV (Deflate)CSV (GZip)Jackson (Plain)Jackson (GZip)Jackson SmileJava ObjectKryo

MEASURE

AXISMax spout pendingSQL workers

Kafka fetch speedDB write speedKafka / DB ratioCapacity

DB batch sizeKafka fetch size

Latency

METRICS

Serialization…

MONITOR

STORM UI TOPOLOGY

METRICS

GRAPHITE

GOTCHAS

Version 0.9.1Partially in fluxKafka integrationMessage & topology versioningPerformance tuning

Lambda Architecture

NEXT?

Master Dataset

Real-time Views

Serving LayerBatch Layer

Speed Layer

NewData

Query

Query

Batch Views

http://storm.incubator.apache.org

RESOURCES

http://lambda-architecture.nethttp://kafka.apache.org

http://www.gimp.org

PRESENTATION TOOLS

http://www.pictaculous.com

http://www.colourlovers.comhttp://www.easycalculation.com

http://paletton.com

QUESTIONS?

Technology

Storm overview & integration