Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Extending the Yahoo! Streaming Benchmark

Jamie Grier@jamiegrierjamie@data-artisans.com

Who am I?• Director of Applications Engineering at data

Artisans• Previously working on streaming

computation at Twitter, Gnip and Boulder Imaging

• Involved in various kinds of stream processing for about a decade

• High-speed video, social media streaming, general frameworks for stream processing

Overview• Yahoo! performed a benchmark comparing

Apache Flink, Storm and Spark• The benchmark never actually pushed Flink

to it’s throughput limits but stopped at Storms limits

• I knew Flink was capable of much more so I repeated the benchmarks myself

• I did a follow up blog post explaining my findings and will summarize them here

Yahoo! Benchmark• Count ad impressions grouped by

campaign• Compute aggregates over a 10 second

window• Emit current value of window aggregates

to Redis every second for query• Map ads to campaigns using Redis as well

Any questions so far?

Storm Code

Flink Code

Hardware Specs• 10 Kafka brokers with 2 partitions each• 10 compute nodes (Flink / Storm)• Each machine has 1 Xeon E3-1230-V2@3.30GHz CPU

• 4 cores, 8 vCores (hyperthreading)• 32 GB RAM (only 8GB allocated to JVMs)

• 10 GigE Ethernet between compute nodes• 1 GigE Ethernet between Kafka cluster and compute

Logical Deployment

Data Generat

orKafka Source Filter Project Join

Window Sink Redis

Stream Processor

Apache StormDeployment

Source Filter Project Join Window Sink

FlinkData Generator

Shuffle

Apache Storm10 Gige Link1 Gige Link

FlinkData Generator

Shuffle

10 Gige Link1 Gige Link

Source / Filter Project Join Window Sink

FlinkData Generator

Shuffle

Source / Filter / Project Join Window Sink

FlinkData Generator

Shuffle

Source / Filter / Project / Join Window Sink

FlinkData Generator

Shuffle

Window / Sink

FlinkData Generator

Shuffle

Source / Filter / Project / Join

FlinkData Generator

Shuffle

Window / SinkSource / Filter / Project / Join

FlinkData Generator

Shuffle

Apache FlinkDeployment

Apache Flink

Processing Guarantees

Apples and OrangesApache Storm Apache Flink

At least once semantics

Exactly once semantics

Double counting after failures No double counting

Lost state after failures No state loss

Benchmark

Storm (Kafka, 1 GigE)

Flink (Kafka, 1 GigE)

0 1 2 2 3 4

Baseline

Throughput: msgs/sec

Bottleneck AnalysisApache Storm

FlinkData Generator

Shuffle

Bottleneck AnalysisApache Storm

FlinkData Generator

Shuffle

FlinkData Generator

Shuffle

Bottleneck AnalysisApache Flink

Apache Flink

FlinkData Generator

Shuffle

Bottleneck AnalysisApache Flink

Apache Flink

Network

FlinkData Generator

Shuffle

Eliminate theBottleneck

Apache Flink

FlinkData Generator

Shuffle

Apache Flink

Shuffle

Apache Flink

DataGenerator

Shuffle

Apache Flink

DataGenerator

Apache FlinkDeployment

Round 2

Benchmark

0 1 2 2 3 4

Baseline

BenchmarkRound 2

Flink (DataGen, 10 GigE)

0 4 8 12 16

10 GigE end-to-end

Results• Apache Flink achieved 15 million messages

/ sec on Yahoo! benchmark• Much stronger processing guarantees:

Exactly once• 80x higher than what was reported in the

original Yahoo! benchmark on similar hardware

Questions?

Shuffle

MapR Cluster

10 Gige Link

DataGenerator

Apache Flink andMapR Streams

MapRStreams

MapR BenchmarkHardware Specs

• 10 MapR nodes, 3X data replication• Each node has 1 Xeon E5-2660-v3 @ 2.60GHz

CPU• 10 cores, 20 vCores (hyperthreading)• 16 vCores used for Flink on each node• 256 GB RAM (only 8GB allocated to Flink)

• 40 GigE Ethernet between compute nodes

Benchmarking on MapRHPC Cluster

Series1

40 GigE end-to-end

10 Million msgs/sec(with 3x replication)

Benchmarking on MapRHPC Cluster

Flink (MapR Streams)

Flink (w/ Data Gener-ator)

0 20 40 60 80

40 GigE end-to-end

BenchmarkingSummary

Flink (MapR, 40 GigE)

0 20 40 60 80

What’s missing?

0 1Throughput: msgs/sec

Results• Apache Flink achieved 10 million messages

/ sec on Yahoo! benchmark when paired with MapR Streams and a high-performance 10 node cluster

• On the same cluster hardware Apache Flink achieved 72 millions message / sec when using direct data generation

Storm Compatibility• Lot’s of companies already have applications

written using the Storm API• Flink provides a Storm compatibility layer• Run your Storm jobs on Flink with a one line

code change• Flink also allows you to reuse your existing

Storm spout and bolt code from a Flink job• Give it a try!

Thanks to MapR!Special thanks to:

Terry HeTed Dunning

Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Engineering

MAPR Multiple Antenna Profiler Radar

Table of Contents - MapR...Global Partner Proram i Table of Contents Company Background 1 Why MapR MapR Converge Network 1Partnering with MapR 1 MapR Value Proposition 2MapR Global

Deep Insight Solutions - MapR Ultra

MapR Streams & MapR コンバージド・データ・プラットフォーム

Lenovo Big Data Reference Architecture for MapR ... · PDF fileLenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop Big data ... Architecture for MapR

MapR lucidworks joint webinar

MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

MapR, Hive, and Pig on Google Compute Engine › tech-briefs › mapr-hive-pig-google... · Introduction Scenario MapR, Hive, and Pig on Google Compute Engine Bring Your MapR-based

HP Reference Architecture for MapR M5 · Technical white paper | HP Reference Architecture for MapR M5 4 Figure 1. MapR’s software overview Note MapR provides the aforementioned

Hands on MapR -- Viadea

Meditation Awareness Peace research - MAPr · Van: MAPr, Bergen nh info@meditationapr.org Onderwerp: MAPr Nieuwsbrief 2015-1, januari, nr 30 Datum: 22 januari 2015 12:59 Aan: Adeline

MAPR 2010 Presentation - Parrott Zhang

Philly DB MapR Overview

Adding Velocity to BigBench - dbtest.io...application benchmarks: Linear Road, AIM Benchmark, Yahoo Streaming Benchmark, RIoTBench → none of the above benchmarks integrates an end-to-end

MapR Data Platform Reference Architecture for Oracle Cloud ......MapR cluster architecture on Oracle Cloud Infrastructure follows supported reference architecture from MapR. A basic

Optimized Infrastructure for Big Data Analytics on MapR ......MapR-XD — MapR-XD Cloud-Scale Data Store is the industry’s only exabyte scale data store for building intelligent

Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Presentation

MapR Case Study - valueselling.com Studies/MapR Case Study_VSA.pdf · Since MapR installed the ValueSelling methodology several years ago, the entire organization, from the top-down,

Subasta de Arte MAPR

SAP & MapR Solution Brief 2015