Scaling Apache Storm - Hadoop Summit 2014

Scaling Apache Storm P. Taylor Goetz, Hortonworks @ptgoetz

About Me Member of Technical Staff / Storm Tech Lead @ Hortonworks Storm Committer / PPMC Member / Release Mgr. @ Apache

About Me Member of Technical Staff / Storm Tech Lead @ Hortonworks Storm Committer / PPMC Member / Release Mgr. @ Apache Volunteer Firefighter since 2004

1M+ messages / sec. on a 10-15 node cluster How do you get there?

How do you fight fire?

Put the wet stuff on the red stuff. Water, and lots of it.

When you're dealing with big fire, you need big water.

Water Sources Lakes Streams Reservoirs, Pools, Ponds

Data Hydrant You heard it here first.

How does this relate to Storm?

Littles Law L=W The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, , multiplied by the average time a customer spends in the system, W; or expressed algebraically: L = W. http://en.wikipedia.org/wiki/Little's_law

Batch vs. Streaming

Batch Processing Typically operates on data at rest Velocity is a function of performance Poor performance costs you time

Stream Processing At the mercy of your data source Velocity fluctuates over time Poor performance.

Poor performance bursts the pipes. Buffers fill up and eat memory Timeouts / Replays Sink systems overwhelmed

What can developers do?

public class MyBolt extends BaseRichBolt { public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { // initialize task } public void execute(Tuple input) { // process input QUICKLY! } public void declareOutputFields(OutputFieldsDeclarer declarer) { // declare output } } Keep tuple processing code tight Worry about this!

public class MyBolt extends BaseRichBolt { public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { // initialize task } public void execute(Tuple input) { // process input QUICKLY! } public void declareOutputFields(OutputFieldsDeclarer declarer) { // declare output } } Keep tuple processing code tight Not this.

Know your latencies L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms https://gist.github.com/jboner/2841832

Use a Cache Guava is your friend.

DevOps will appreciate it. Expose your knobs and gauges.

What can DevOps do?

How big is your hose?

Text Find out!

Text Performance testing is essential!

How to deal with small pipes? (i.e. When your output is more like a garden hose.)

Parallelize Slow sinks

Parallelism == Manifold Take input from one big pipe and distribute it to many smaller pipes The bigger the size difference, the more parallelism you will need

Sizeup Initial assessment

Every fire is different.

Every Storm use case is different.

Sizeup Fire What are my water sources? What GPM can they support? How many lines (hoses) will I need? How much water will I need to flow to put this fire out?

Sizeup Storm What are my input sources? At what rate do they deliver messages? What size are the messages? What's my slowest data sink?

There is no magic bullet.

But there are good starting points.

Numbers Where to start.

1 Worker / Machine / Topology Keep unnecessary network transfer to a minimum

1 Acker / Worker Default in Storm 0.9.x

1 Executor / CPU Core Optimize Thread/CPU usage

1 Executor / CPU Core (for CPU-bound use cases)

1 Executor / CPU Core Multiply by 10x-100x for I/O bound use cases

Example 10 Worker Nodes 16 Cores / Machine 10 * 16 = 160 Parallelism Units available

Example 10 Worker Nodes 16 Cores / Machine 10 * 16 = 160 Parallelism Units available Subtract # Ackers: 160 - 10 = 150 Units.

Example 10 Worker Nodes 16 Cores / Machine (10 * 16) - 10 = 150 Parallelism Units available

Example 10 Worker Nodes 16 Cores / Machine (10 * 16) - 10 = 150 Parallelism Units available (* 10-100 if I/O bound) Distrubte this among tasks in topology. Higher for slow tasks, lower for fast tasks.

This is just a starting point. Test, test, test. Measure, measure, measure.

Internal Messaging Handling backpressure.

Internal Messaging (Intra-worker)

Turn knobs slowly, one at a time.

Don't mess with settings you don't understand.

Storm ships with sane defaults Override only as necessary

Hardware Considerations

Minimum Hardware Requirements

CPU Cores More is usually better The more you have the more threads you can support (i.e. parallelism) Storm potentially uses a LOT of threads

Memory Highly use-case specific How many workers (JVMs) per node? Are you caching and/or holding in-memory state? Tests/metrics are your friends

Network Use bonded NICs if necessary Keep nodes close

Other performance considerations

Dont Pancake! Separate concerns.

Keep this guy happy. He has big boots and a shovel. He will hurt you if you piss him off.

Shameless Plug http://www.packtpub.com/sto rm-distributed-real-time- computation-blueprints/book

Thanks! Questions? Storm BoF Session 3:30 Room 230A

Software

Scaling Apache Storm - Hadoop Summit 2014