Slides form my Hadoop Summit presentation on scaling Apache Storm. Presented at Hadoop Summit 2014, San Jose.
Citation preview
Scaling Apache Storm P. Taylor Goetz, Hortonworks @ptgoetz
About Me Member of Technical Staff / Storm Tech Lead @
Hortonworks Storm Committer / PPMC Member / Release Mgr. @
Apache
About Me Member of Technical Staff / Storm Tech Lead @
Hortonworks Storm Committer / PPMC Member / Release Mgr. @ Apache
Volunteer Firefighter since 2004
1M+ messages / sec. on a 10-15 node cluster How do you get
there?
How do you fight fire?
Put the wet stuff on the red stuff. Water, and lots of it.
When you're dealing with big fire, you need big water.
Water Sources Lakes Streams Reservoirs, Pools, Ponds
Data Hydrant You heard it here first.
How does this relate to Storm?
Littles Law L=W The long-term average number of customers in a
stable system L is equal to the long-term average effective arrival
rate, , multiplied by the average time a customer spends in the
system, W; or expressed algebraically: L = W.
http://en.wikipedia.org/wiki/Little's_law
Batch vs. Streaming
Batch Processing Typically operates on data at rest Velocity is
a function of performance Poor performance costs you time
Stream Processing At the mercy of your data source Velocity
fluctuates over time Poor performance.
Poor performance bursts the pipes. Buffers fill up and eat
memory Timeouts / Replays Sink systems overwhelmed
What can developers do?
public class MyBolt extends BaseRichBolt { public void
prepare(Map stormConf, TopologyContext context, OutputCollector
collector) { // initialize task } public void execute(Tuple input)
{ // process input QUICKLY! } public void
declareOutputFields(OutputFieldsDeclarer declarer) { // declare
output } } Keep tuple processing code tight Worry about this!
public class MyBolt extends BaseRichBolt { public void
prepare(Map stormConf, TopologyContext context, OutputCollector
collector) { // initialize task } public void execute(Tuple input)
{ // process input QUICKLY! } public void
declareOutputFields(OutputFieldsDeclarer declarer) { // declare
output } } Keep tuple processing code tight Not this.
Know your latencies L1 cache reference 0.5 ns Branch mispredict
5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress
1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network
10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms
Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip
within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially
from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms
20x datacenter roundtrip Read 1 MB sequentially from disk
20,000,000 ns 20 ms 80x memory, 20X SSD Send packet
CA->Netherlands->CA 150,000,000 ns 150 ms
https://gist.github.com/jboner/2841832
Use a Cache Guava is your friend.
DevOps will appreciate it. Expose your knobs and gauges.
What can DevOps do?
How big is your hose?
Text Find out!
Text Performance testing is essential!
How to deal with small pipes? (i.e. When your output is more
like a garden hose.)
Parallelize Slow sinks
Parallelism == Manifold Take input from one big pipe and
distribute it to many smaller pipes The bigger the size difference,
the more parallelism you will need
Sizeup Initial assessment
Every fire is different.
Text
Every Storm use case is different.
Sizeup Fire What are my water sources? What GPM can they
support? How many lines (hoses) will I need? How much water will I
need to flow to put this fire out?
Sizeup Storm What are my input sources? At what rate do they
deliver messages? What size are the messages? What's my slowest
data sink?
There is no magic bullet.
But there are good starting points.
Numbers Where to start.
1 Worker / Machine / Topology Keep unnecessary network transfer
to a minimum
1 Acker / Worker Default in Storm 0.9.x
1 Executor / CPU Core Optimize Thread/CPU usage
1 Executor / CPU Core (for CPU-bound use cases)
1 Executor / CPU Core Multiply by 10x-100x for I/O bound use
cases
Example 10 Worker Nodes 16 Cores / Machine 10 * 16 = 160
Parallelism Units available
Example 10 Worker Nodes 16 Cores / Machine 10 * 16 = 160
Parallelism Units available Subtract # Ackers: 160 - 10 = 150
Units.
Example 10 Worker Nodes 16 Cores / Machine (10 * 16) - 10 = 150
Parallelism Units available
Example 10 Worker Nodes 16 Cores / Machine (10 * 16) - 10 = 150
Parallelism Units available (* 10-100 if I/O bound) Distrubte this
among tasks in topology. Higher for slow tasks, lower for fast
tasks.
This is just a starting point. Test, test, test. Measure,
measure, measure.
Internal Messaging Handling backpressure.
Internal Messaging (Intra-worker)
Turn knobs slowly, one at a time.
Don't mess with settings you don't understand.
Storm ships with sane defaults Override only as necessary
Hardware Considerations
Minimum Hardware Requirements
CPU Cores More is usually better The more you have the more
threads you can support (i.e. parallelism) Storm potentially uses a
LOT of threads
Memory Highly use-case specific How many workers (JVMs) per
node? Are you caching and/or holding in-memory state? Tests/metrics
are your friends
Network Use bonded NICs if necessary Keep nodes close
Other performance considerations
Dont Pancake! Separate concerns.
Keep this guy happy. He has big boots and a shovel. He will
hurt you if you piss him off.