Upload
vanja-radovanovic
View
107
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
STORM
Buckle up Dorothy !!!
Distributed real-time computation
ABOUT
By Nathan MarzBacktype => Twitter => Apache
Real-time analytics
WHAT IS IT GOOD FOR?
Online machine learningContinuous computationDistributed RPCETL (Extract, Transform, Load)…
No data loss
Fault-tolerantScalable
PROMISES
Robust
VIEW FROM ABOVE
StorageTopologyStreamSource
Storm Cluster
Pull
(Kafka,*MQ, …)
Read/Write
PRIMITIVES
Field 1 / Value 1
Field 2 / Value 2
Field 3 / Value 3
Field 4 / Value 4
Field 5 / Value 5
Tuple
Tuple Tuple Tuple Tuple
Stream
TopologyBolt
PRIMITIVES
SpoutT T
T
Bolt
Spout
Bolt
Bolt
TT
T
T TT
T TT
TT
T
TT
T
T
ABSTRACTION
PRIMITIVES
TuplesFilters
TransformationIncrementalDistributedScalable
FunctionsJoins
Chaining streamsSmall components
EFFECTS
SpoutsBolts
CLUSTER
Nimbus Zookeeper Cluster
Worker Node
Executor
Supervisor
Executor
Executor
Worker Node
Executor
Supervisor
Executor
Executor
Worker Node
Executor
Supervisor
Executor
Executor
NIMBUS / NODES
CLUSTER
SmallNo state
CommunicationStateRobustKill / Restart easy
ZOOKEEPER
No data loss
Fault-tolerantScalable
AS PROMISED?
Robust
GUARANTEES
Message transforms into a tuple treeStorm tracks tuple treeFully processed when tree exhausted
FAILURES
Task died – failed tuples replayedAcker task died – related tuples timeout and are replayedSpout task died – source replays, e.g. pending messages are placed back on the queue
WHAT DO I HAVE TO DO?
Inform about new links in treeInform when finished with a tupleEvery tuple must be acked or failed
TRIDENT
ANYTHING SIMPLER?
High level abstractionStateful persistence primitives Exactly-once semantics
AS PROMISED?
YES
USER DASHBOARD
PROBLEMBad performanceUses core storage
Pre-computeCustomizeFast
IDEA
IsolateQuarterly agg.
ARCHITECTURE
Core
Events Queue
Kafka
4 Partitions 2 Replicas
Storm
4 Workers
MS SQL
4 Staging
Dashboard
Push
Pull Write
Read
State in source
KAFKA
987654321
New
Client
Topic StackedFlushedClient offsetReplicated
Old
PartitionedFast
TRANSFORMATION
ORIGINAL{ id: df45er87c78df, sender: “Info”, destination: “39345123456”, parts: 2, price: 100, client: “Demo”, time: “2014-06-02 14:47:58”, country: “IT”, network: “Wind”, type: “SMS”, …}
{ client: “Demo”, type: “SMS”, country: “IT”, network: “Wind”, bucket: “2014-06-02 14:45:00”, traffic: 2, expenses: 200}
COMPUTED
CODE
TridentState tridentState = topology .newStream("CoreEvents", buildKafkaSpout()) .parallelismHint(4) .each( new Fields("bytes"), new CoreEventMessageParser(), new Fields("time", "client", "network", "country", "type", "parts", "price")) .each( new Fields("time"), new QuarterTimeBucket(), new Fields("bucket")) .project(new Fields("bucket", "client", "network", "country", "type", "traffic", "expenses“)) .groupBy(new Fields("bucket", "client", "network", "country", "type")) .persistentAggregate(getStateFactory(), new Fields("traffic", "expenses"), new Sum(), new Fields("trafficExpenses")) .parallelismHint(8);
PERFORMANCE
1.500
PEAK
REGULARKAFKA 60.000
4.500 160.000STORA
GE2.000 10.000
DASHBOARD
1 1
TUNING STORAGE
1st Issue - StorageRandom access – 1.500 w/s limitStaged approach – 30.000 w/s limit
No locks – isolatedScalable – each worker it’s stageMain table indexing nicelyDoesn’t affect reading
STAGED WRITES
Worker 1
Main Table
MergeWorker 2
Stage Table 1
Stage Table 2
MergeWrite
Write
TUNING TOPOLOGY
2nd Issue - Serialization
Raw/s Expanded/s Writes/s0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
200 KB1 MB4 MB8 MB16 MB24 MB
Plateauing
SERIALIZATION
S [s] S [byte] S [% CPU] D [s] D [% CPU]0
200
400
600
800
1,000
1,200
CSV (Plain)CSV (Deflate)CSV (GZip)Jackson (Plain)Jackson (GZip)Jackson SmileJava ObjectKryo
MEASURE
AXISMax spout pendingSQL workers
Kafka fetch speedDB write speedKafka / DB ratioCapacity
DB batch sizeKafka fetch size
Latency
METRICS
Serialization…
MONITOR
STORM UI TOPOLOGY
METRICS
GRAPHITE
GOTCHAS
Version 0.9.1Partially in fluxKafka integrationMessage & topology versioningPerformance tuning
Lambda Architecture
NEXT?
Master Dataset
Real-time Views
Serving LayerBatch Layer
Speed Layer
NewData
Query
Query
Batch Views
http://storm.incubator.apache.org
RESOURCES
http://lambda-architecture.nethttp://kafka.apache.org
http://www.gimp.org
PRESENTATION TOOLS
http://www.pictaculous.com
http://www.colourlovers.comhttp://www.easycalculation.com
http://paletton.com
QUESTIONS?