Storm overview & integration

Buckle up Dorothy !!!

Distributed real-time computation

By Nathan MarzBacktype => Twitter => Apache

Real-time analytics

WHAT IS IT GOOD FOR?

Online machine learningContinuous computationDistributed RPCETL (Extract, Transform, Load)…

No data loss

Fault-tolerantScalable

PROMISES

Robust

VIEW FROM ABOVE

StorageTopologyStreamSource

Storm Cluster

(Kafka,*MQ, …)

Read/Write

PRIMITIVES

Field 1 / Value 1

Field 2 / Value 2

Field 3 / Value 3

Field 4 / Value 4

Field 5 / Value 5

Tuple Tuple Tuple Tuple

Stream

TopologyBolt

PRIMITIVES

SpoutT T

ABSTRACTION

PRIMITIVES

TuplesFilters

TransformationIncrementalDistributedScalable

FunctionsJoins

Chaining streamsSmall components

EFFECTS

SpoutsBolts

CLUSTER

Nimbus Zookeeper Cluster

Worker Node

Executor

Supervisor

Executor

Worker Node

Executor

Supervisor

Executor

Worker Node

Executor

Supervisor

Executor

NIMBUS / NODES

CLUSTER

SmallNo state

CommunicationStateRobustKill / Restart easy

ZOOKEEPER

No data loss

Fault-tolerantScalable

AS PROMISED?

Robust

GUARANTEES

Message transforms into a tuple treeStorm tracks tuple treeFully processed when tree exhausted

FAILURES

Task died – failed tuples replayedAcker task died – related tuples timeout and are replayedSpout task died – source replays, e.g. pending messages are placed back on the queue

WHAT DO I HAVE TO DO?

Inform about new links in treeInform when finished with a tupleEvery tuple must be acked or failed

TRIDENT

ANYTHING SIMPLER?

High level abstractionStateful persistence primitives Exactly-once semantics

AS PROMISED?

USER DASHBOARD

PROBLEMBad performanceUses core storage

Pre-computeCustomizeFast

IsolateQuarterly agg.

ARCHITECTURE

Events Queue

4 Partitions 2 Replicas

4 Workers

MS SQL

4 Staging

Dashboard

Pull Write

State in source

987654321

Client

Topic StackedFlushedClient offsetReplicated

PartitionedFast

TRANSFORMATION

ORIGINAL{ id: df45er87c78df, sender: “Info”, destination: “39345123456”, parts: 2, price: 100, client: “Demo”, time: “2014-06-02 14:47:58”, country: “IT”, network: “Wind”, type: “SMS”, …}

{ client: “Demo”, type: “SMS”, country: “IT”, network: “Wind”, bucket: “2014-06-02 14:45:00”, traffic: 2, expenses: 200}

COMPUTED

TridentState tridentState = topology .newStream("CoreEvents", buildKafkaSpout()) .parallelismHint(4) .each( new Fields("bytes"), new CoreEventMessageParser(), new Fields("time", "client", "network", "country", "type", "parts", "price")) .each( new Fields("time"), new QuarterTimeBucket(), new Fields("bucket")) .project(new Fields("bucket", "client", "network", "country", "type", "traffic", "expenses“)) .groupBy(new Fields("bucket", "client", "network", "country", "type")) .persistentAggregate(getStateFactory(), new Fields("traffic", "expenses"), new Sum(), new Fields("trafficExpenses")) .parallelismHint(8);

PERFORMANCE

REGULARKAFKA 60.000

4.500 160.000STORA

GE2.000 10.000

DASHBOARD

TUNING STORAGE

1st Issue - StorageRandom access – 1.500 w/s limitStaged approach – 30.000 w/s limit

No locks – isolatedScalable – each worker it’s stageMain table indexing nicelyDoesn’t affect reading

STAGED WRITES

Worker 1

Main Table

MergeWorker 2

Stage Table 1

Stage Table 2

MergeWrite

TUNING TOPOLOGY

2nd Issue - Serialization

Raw/s Expanded/s Writes/s0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

200 KB1 MB4 MB8 MB16 MB24 MB

Plateauing

SERIALIZATION

S [s] S [byte] S [% CPU] D [s] D [% CPU]0

CSV (Plain)CSV (Deflate)CSV (GZip)Jackson (Plain)Jackson (GZip)Jackson SmileJava ObjectKryo

MEASURE

AXISMax spout pendingSQL workers

Kafka fetch speedDB write speedKafka / DB ratioCapacity

DB batch sizeKafka fetch size

Latency

METRICS

Serialization…

MONITOR

STORM UI TOPOLOGY

METRICS

GRAPHITE

GOTCHAS

Version 0.9.1Partially in fluxKafka integrationMessage & topology versioningPerformance tuning

Lambda Architecture

Master Dataset

Real-time Views

Serving LayerBatch Layer

Speed Layer

NewData

Batch Views

http://storm.incubator.apache.org

RESOURCES

http://lambda-architecture.nethttp://kafka.apache.org

http://www.gimp.org

PRESENTATION TOOLS

http://www.pictaculous.com

http://www.colourlovers.comhttp://www.easycalculation.com

http://paletton.com

QUESTIONS?

Storm overview & integration

Technology

The Electric Grid: OSW Integration and Storm-Hardening in New Jersey

Canada RPAS Integration Overview

Integration BW-BPS Overview

[1]Oracle® Retail Enterprise Integration Overview Guide ... Integration Overview RICS/ent...[1]Oracle® Retail Enterprise Integration Overview Guide – RICS Release 16.0.040 F23331-01

Technical Overview ServiceNow Integration

Overview for Renewables Integration

Numeric Integration Methods Marq Singer Red Storm Entertainment marqs@redstorm.com

Storm Data Program Overview

Company Overview Integration. Innovation, Ingenuity

Integration Services Technical Overview

Tide Gauge and Satellite Altimetry integration for Storm

SAP Overview SAP Security Integration

Integration Manager - Product Overview

Real-Time Big Data: Storm Architecture and Integration Patterns

Toward new HSM solution using GPFS/TSM/StoRM integration

Continuous integration. Short overview

Compact Can Integration Overview

Omaha Storm Chasers Community Relations Overview

Integration Systems EMEA Overview

Building Integration System - Bosch Securityresource.boschsecurity.com/documents/BIS_Selection_Guide_Quick... · 4 Building Integration System – System overview System overview