56
Introduction to Twitter Storm

Introduction to Twitter Storm

Embed Size (px)

DESCRIPTION

Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany Agenda: - Why Twitter Storm? - What is Twitter Storm? - What to do with Twitter Storm?

Citation preview

Page 1: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013

Introduction to Twitter Storm

uweseiler

Page 2: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 About me

Big Data Nerd

TravelpiratePhotography Enthusiast

Hadoop Trainer MongoDB Author

Page 3: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 About us

is a bunch of…

Big Data Nerds Agile Ninjas Continuous Delivery Gurus

Enterprise Java Specialists Performance Geeks

Join us!

Page 4: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Agenda

• Why Twitter Storm?

• What is Twitter Storm?

• What to do with Twitter Storm?

Page 5: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 The 3 V’s of Big Data

VarietyVolume Velocity

Page 6: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Velocity

Page 7: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Why Twitter Storm?

Page 8: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Batch vs. Real-Time processing

• Batch processing – Gathering of data and processing as a

group at one time.

• Real-time processing– Processing of data that takes place as the

information is being entered.

Page 9: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Lambda architecture

Page 10: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Bridging the gap…

• A batch workflow is too slow• Views are out of date

Absorbed into batch views

Time

Not Absorbed

Now

Just a few hoursof data

Page 11: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Storm vs. Hadoop

• Real-time processing

• Topologies run forever

• No SPOF• Stateless nodes

• Batch processing• Jobs run to

completion• NameNode is SPOF• Stateful nodes

• Scalable• Gurantees no dataloss

• Open Source

Page 12: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Stream Processing

Stream processing is a technical paradigm to process big volumes of unbound sequence of tuples in real-time

Source Stream Processing

• Algorithmic trading• Sensor data monitoring• Continuous analytics

Page 13: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Example: Stream of tweets

https://github.com/colinsurprenant/tweitgeist

Page 14: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Agenda

• Why Twitter Storm?

• What is Twitter Storm?

• What to do with Twitter Storm?

Page 15: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Welcome, Twitter Storm!

• Created by Nathan Marz @ BackType– Analyze tweets, links, users on Twitter

• Open sourced on 19th September, 2011– Eclipse Public License 1.0– Storm v0.5.2

• Latest Updates– Current stable release v0.8.2 released on 11th January,

2013– Major core improvements planned for v0.9.0– Storm will be an Apache Project [soon..]

Page 16: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Storm under the hood

• Java & Clojure

• Apache Thrift– Cross language bridge, RPC, Framework to build

services

• ZeroMQ– Asynchronous message transport layer

• Kryo– Serialization framework

• Jetty– Embedded web server

Page 17: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Conceptual view

Spout

Spout

Spout:Source of streams

Bolt

Bolt

Bolt

Bolt

Bolt

Bolt:Consumer of streams,Processing of tuples,Possibly emits new tuples

Tuple

Tuple

TupleTuple:

List of name-value pairs

Stream:Unbound sequence of tuples

Topology: Network of Spouts & Bolts as the nodes and stream as the edge

Page 18: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Physical view

Java thread spawned by worker, runs one or more tasks of the same component

Nimbus

ZooKeeper

WorkerSupervisor

Executor Task

ZooKeeper

ZooKeeper

Supervisor

Supervisor

Supervisor

Supervisor

Worker

Worker

Worker Node

Worker Process

Java process executing a subset of topology

Component (Spout/Bolt) instance, performs the actual data processing

Master daemon process

Responsible for• distributing code• assigning tasks• monitoring failures

Storing operational cluster state

Worker daemon process listening for work assigned to its node

Page 19: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 A simple example: WordCount

FileReaderSpout

WordSplitBolt

WordCountBolt

line

shakespeare.txt

word

of: 18126to: 18763i: 19540and: 26099the: 27730

Sorted list

Page 20: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 FileReaderSpout I

package de.codecentric.storm.wordcount.spouts;

import java.io.BufferedReader;

import java.io.FileNotFoundException;

import java.io.FileReader;

import java.util.Map;

import backtype.storm.spout.SpoutOutputCollector;

import backtype.storm.task.TopologyContext;

import backtype.storm.topology.OutputFieldsDeclarer;

import backtype.storm.topology.base.BaseRichSpout;

import backtype.storm.tuple.Fields;

import backtype.storm.tuple.Values;

public class FileReaderSpout extends BaseRichSpout {

private SpoutOutputCollector collector;

private FileReader fileReader;

private boolean completed = false;

public void ack(Object msgId) {

System.out.println("OK:" + msgId);

}

public void fail(Object msgId) {

System.out.println("FAIL:" + msgId);

}

Page 21: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 FileReaderSpout II

/**

* Declare the output field "line"

*/

public void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields("line"));

}

/**

* We will read the file and get the collector object

*/

public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {

try {

this.fileReader = new FileReader(conf.get("wordsFile").toString());

} catch (FileNotFoundException e) {

throw new RuntimeException("Error reading file ["

+ conf.get("wordFile") + "]");

}

this.collector = collector;

}

public void close() {

}

Page 22: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 FileReaderSpout III

/**

* The only thing that the methods will do is emit each file line

*/

public void nextTuple() {

/**

* The nextuple it is called forever, so if we have read the file we

* will wait and then return

*/

String str;

// Open the reader

BufferedReader reader = new BufferedReader(fileReader);

try {

// Read all lines

while ((str = reader.readLine()) != null) {

/**

* Emit each line as a value

*/

this.collector.emit(new Values(str), str);

}

} catch (Exception e) {

throw new RuntimeException("Error reading tuple", e);

} finally {

completed = true;

}

}

}

Page 23: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 WordSplitBolt I

package de.codecentric.storm.wordcount.bolts;

import backtype.storm.topology.BasicOutputCollector;

import backtype.storm.topology.OutputFieldsDeclarer;

import backtype.storm.topology.base.BaseBasicBolt;

import backtype.storm.tuple.Fields;

import backtype.storm.tuple.Tuple;

import backtype.storm.tuple.Values;

public class WordSplitBolt extends BaseBasicBolt {

public void cleanup() {}

/**

* The bolt will only emit the field "word"

*/

public void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields("word"));

}

Page 24: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 WordSplitBolt II

/**

* The bolt will receive the line from the

* words file and process it to split it into words

*/

public void execute(Tuple input, BasicOutputCollector collector) {

String sentence = input.getString(0);

String[] words = sentence.split(" ");

for(String word : words){

word = word.trim();

if(!word.isEmpty()){

word = word.toLowerCase();

collector.emit(new Values(word));

}

}

}

Page 25: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 WordCountBolt I

package de.codecentric.storm.wordcount.bolts;

import java.util.Comparator;

import java.util.HashMap;

import java.util.Map;

import java.util.SortedSet;

import java.util.TreeSet;

import backtype.storm.task.TopologyContext;

import backtype.storm.topology.BasicOutputCollector;

import backtype.storm.topology.OutputFieldsDeclarer;

import backtype.storm.topology.base.BaseBasicBolt;

import backtype.storm.tuple.Tuple;

public class WordCountBolt extends BaseBasicBolt {

/**

*

*/

private static final long serialVersionUID = 1L;

Integer id;

String name;

Map<String, Integer> counters;

Page 26: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 WordCountBolt II

/**

* On create

*/

@Override

public void prepare(Map stormConf, TopologyContext context) {

this.counters = new HashMap<String, Integer>();

this.name = context.getThisComponentId();

this.id = context.getThisTaskId();

}

@Override

public void declareOutputFields(OutputFieldsDeclarer declarer) {

}

@Override

public void execute(Tuple input, BasicOutputCollector collector) {

String str = input.getString(0);

/**

* If the word doesn't exist in the map we will create this, if not we will add 1

*/

if (!counters.containsKey(str)) {

counters.put(str, 1);

} else {

Integer c = counters.get(str) + 1;

counters.put(str, c);

}

}

Page 27: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 WordCountBolt III

/**

* At the end of the spout (when the cluster is shutdown we will show the

* word counters

*/

@Override

public void cleanup() {

// Sort map

SortedSet<Map.Entry<String, Integer>> sortedCounts = entriesSortedByValues(counters);

System.out.println("-- Word Counter [" + name + "-" + id + "] --");

for (Map.Entry<String, Integer> entry : sortedCounts) {

System.out.println(entry.getKey() + ": " + entry.getValue());

}

}

}

Page 28: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 WordCountTopology

public class WordCountTopology {

public static void main(String[] args) throws InterruptedException {

// Topology definition

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("word-reader",new FileReaderSpout());

builder.setBolt("word-normalizer", new WordSplitBolt())

.shuffleGrouping("word-reader");

builder.setBolt("word-counter", new WordCountBolt(),1)

.fieldsGrouping("word-normalizer", new Fields("word"));

// Configuration

Config conf = new Config();

conf.put("wordsFile", args[0]);

conf.setDebug(false);

// Run Topology

conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);

LocalCluster cluster = new LocalCluster();

cluster.submitTopology("word-count-topology", conf, builder.createTopology());

// You don‘t do this on a regular topology

Utils.sleep(10000);

cluster.killTopology("word-count-topology");

cluster.shutdown();

}

}

Page 29: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Stream Grouping

• Each Spout or Bolt might be running n instances in parallel

• Groupings are used to decide to which task in the subscribing bolt (group) a tuple is sent to.

• Possible Groupings:

Grouping FeatureShuffle Random grouping

Fields Grouped by value such that equal value results in same task

All Replicates to all tasks

Global Makes all tuples go to one task

None Makes Bolt run in the same thread as the Bolt / Spout it subscribes to

Direct Producer (task that emits) controls which Consumer will receive

Local If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks

Page 30: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Key features of Twitter Storm

Storm is• Fast & scalable• Fault-tolerant• Guaranteeing message processing• Easy to setup & operate• Free & Open Source

Page 31: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Key features of Twitter Storm

Storm is• Fast & scalable• Fault-tolerant• Guaranteeing message processing• Easy to setup & operate• Free & Open Source

Page 32: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Extremely performant

Page 33: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Parallelism

Number of worker nodes = 2Number of worker slots per node = 4Number of topology worker = 4

FileReaderSpout WordSplitBolt WordCountBolt

Number of tasks = Not specified = Same as parallism hint

Parellism_hint = 2

Number of tasks = 8

Parellism_hint = 4

Number of tasks = Not specified = 6

Parellism_hint = 6

Number of component instances = 2 + 8 + 6 = 16Number of executor threads = 2 + 4 + 6 = 12

Page 34: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Message passing

ReceiveThread

Executor

Transfer ThreadExecutor

Executor

Receiver queue

To other workers

From other workers

Internal transfer queue

Transfer queue

Interprocess communication is mediated by ZeroMQOutside transfer is done with Kryo serialization

Local communication is mediated by LMAX DisruptorInside transfer is done with no serialization

Page 35: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Key features of Twitter Storm

Storm is• Fast & scalable• Fault-tolerant• Guaranteeing message processing• Easy to setup & operate• Free & Open Source

Page 36: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Fault tolerance

Nimbus ZooKeeper Supervisor Worker

Cluster works normally

Monitoringcluster state

Synchronizingassignment

Sending heartbeat

Reading worker heartbeat from local filesystem

Sending executor heartbeat

Page 37: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Fault tolerance

Nimbus ZooKeeper Supervisor Worker

Nimbus goes down

Monitoringcluster state

Synchronizingassignment

Sending heartbeat

Reading worker heartbeat from local filesystem

Sending executor heartbeat

Processing will still continue. But topology lifecycle operations and reassignment facility are lost

Page 38: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Fault tolerance

Nimbus ZooKeeper Supervisor Worker

Worker node goes down

Monitoringcluster state

Sending executor heartbeat

Nimbus will reassign the tasks to other machines and the processing will continue

Supervisor Worker

Synchronizingassignment

Sending heartbeat

Reading worker heartbeat from local filesystem

Page 39: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Fault tolerance

Nimbus ZooKeeper Supervisor Worker

Supervisor goes down

Monitoringcluster state

Synchronizingassignment

Sending heartbeat

Reading worker heartbeat from local filesystem

Sending executor heartbeat

Processing will still continue. But assignment is never synchronized

Page 40: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Fault tolerance

Nimbus ZooKeeper Supervisor Worker

Worker process goes down

Monitoringcluster state

Synchronizingassignment

Sending heartbeat

Reading worker heartbeat from local filesystem

Sending executor heartbeat

Supervisor will restart the worker process and the processing will continue

Page 41: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Key features of Twitter Storm

Storm is• Fast & scalable• Fault-tolerant• Guaranteeing message processing• Easy to setup & operate• Free & Open Source

Page 42: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Reliability API

public class FileReaderSpout extends BaseRichSpout {

public void nextTuple() {

…;

UUID messageID = getMsgID();

collector.emit(newValues(line), msgId)

}

public void ack(Object msgId) {

// Do something with acked message id

}

public void fail(Object msgId) {

// Do something with failes message id

}

}

public class WordSplitBolt extends BaseBasicBolt {

public void execute(Tuple input, BasicOutputCollector collector) {

for (String s : input.getString(0).split("\\s")) {

collector.emit(input, newValues(s));

}

collector.ack(input);

}

}

Tupel tree

Anchoring incoming tuple to outgoing tuplesSending ack

This

“This is a line”

This

This

This

Emiting tuple with Message ID

Page 43: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 ACKing Framework

ACKer init

FileReaderSpout WordSplitBolt WordCountBolt

ACKer implicit boltACKer ack

ACKer failACKer ackACKer fail

Tuple A

Tuple B

Tuple C

• Emitted tuple A, XOR tuple A id with ack val• Emitted tuple B, XOR tuple B id with ack val• Emitted tuple C, XOR tuple C id with ack val• Acked tuple A, XOR tuple A id with ack val• Acked tuple B, XOR tuple B id with ack val• Acked tuple C, XOR tuple C id with ack val

Spout Tuple ID Spout Task ID ACK val (64 Bit)

ACKer implizit boltACK val has become 0, ACKer implicit bolt knows the tuple tree has been completed

Page 44: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Key features of Twitter Storm

Storm is• Fast & scalable• Fault-tolerant• Guaranteeing message processing• Easy to setup & operate• Free & Open Source

Page 45: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Cluster Setup

• Setup ZooKeeper cluster

• Install dependencies on Nimbus and worker machines– ZeroMQ 2.1.7 and JZMQ– Java 6 and Python 2.6.6– unzip

• Download and extract a Storm release to Nimbus and worker machines

• Fill in mandatory configuration into storm.yaml

• Launch daemons under supervision using storm scripts

• Start a topology:

– storm jar <path_topology_jar> <main_class> <arg1>…<argN>

Page 46: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Cluster Summary

Page 47: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Topology Summary

Page 48: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Component Summary

Page 49: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Key features of Twitter Storm

Storm is• Fast & scalable• Fault-tolerant• Guaranteeing message processing• Easy to setup & operate• Free & Open Source

Page 50: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Basic resources

• Storm is available at– http://storm-project.net/– https://github.com/nathanmarz/storm

under Eclipse Public License 1.0

• Get help on– http://groups.google.com/group/storm-user

– #storm-user freenode room

• Follow@stormprocessor and @nathanmarz

Page 51: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Many contributions

• Community repository for modules to use Storm at– https://github.com/nathanmarz/storm-contrib– including integration with Redis, Kafka, MongoDB, HBase, JMS,

Amazon SQS, …

• Good articles for understanding Storm internals– http://www.michael-noll.com/blog/2012/10/16/understanding-the-

parallelism-of-a-stormtopology/– http://www.michael-noll.com/blog/2013/06/21/understanding-storm-

internal-messagebuffers/

• Good slides for understanding real-life examples– http://www.slideshare.net/DanLynn1/storm-as-deep-into-

realtime-data-processing-as-youcan-get-in-30-minutes– http://www.slideshare.net/KrishnaGade2/storm-at-twitter

Page 52: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Coming next…

• Current release: 0.8.2

• Work in progress (newest): 0.9.0-wip21– SLF4J and Logback– Pluggable tuple serialization and blowfish

encryption– Pluggable interprocess messaging and Netty

implementation– Some bug fixes– And more

• Storm on YARN

Page 53: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Agenda

• Why Twitter Storm?

• What is Twitter Storm?

• What to do with Twitter Storm?

Page 54: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 One example: Webshop

• Webtracking component

• No defined page impression

• Identifying page impressions usingVarnish logs of the click stream data

• Page consists of different fragments– Body– Article description– Recommendation box, …

• Session data also of interest

Page 55: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 One example: Webshop

• Custom solution using J2EE andMongoDB

• Export into Comscore DAx andEnterprise DWH

• Solution is currently working but not scalable

• What about performance?

Page 56: Introduction to Twitter Storm

Sankt Augustin24-25.08.2013 Topology Architecture