47
Kostas Tzoumas @kostas_tzoumas Apache Flink TM Counting elements in streams

Apache Flink at Strata San Jose 2016

Embed Size (px)

Citation preview

Page 1: Apache Flink at Strata San Jose 2016

Kostas Tzoumas@kostas_tzoumas

Apache FlinkTMCounting elements in streams

Page 2: Apache Flink at Strata San Jose 2016

Introduction

2

Page 3: Apache Flink at Strata San Jose 2016

3

Data streaming is becoming increasingly popular*

*Biggest understatement of 2016

Page 4: Apache Flink at Strata San Jose 2016

4

Streaming technology is enabling the obvious: continuous processing on data

that is continuously produced

Page 5: Apache Flink at Strata San Jose 2016

5

Streaming is the next programming paradigm for data applications, and

you need to start thinking in terms of streams

Page 6: Apache Flink at Strata San Jose 2016

6

Counting

Page 7: Apache Flink at Strata San Jose 2016

Continuous counting A seemingly simple application,

but generally an unsolved problem

E.g., count visitors, impressions, interactions, clicks, etc

Aggregations and OLAP cube operations are generalizations of counting

7

Page 8: Apache Flink at Strata San Jose 2016

Counting in batch architecture Continuous

ingestion

Periodic (e.g., hourly) files

Periodic batch jobs

8

Page 9: Apache Flink at Strata San Jose 2016

Problems with batch architecture High latency

Too many moving parts

Implicit treatment of time

Out of order event handling

Implicit batch boundaries9

Page 10: Apache Flink at Strata San Jose 2016

Counting in λ architecture "Batch layer":

what we had before

"Stream layer": approximate early results

10

Page 11: Apache Flink at Strata San Jose 2016

Problems with batch and λ Way too many moving parts (and code dup)

Implicit treatment of time

Out of order event handling

Implicit batch boundaries

11

Page 12: Apache Flink at Strata San Jose 2016

Counting in streaming architecture Message queue ensures stream durability and replay

Stream processor ensures consistent counting

12

Page 13: Apache Flink at Strata San Jose 2016

Counting in Flink DataStream API

Number of visitors in last hour by country

13

DataStream<LogEvent> stream = env .addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka .keyBy("country"); // group by country .timeWindow(Time.minutes(60)) // window of size 1 hour .apply(new CountPerWindowFunction()); // do operations per window

Page 14: Apache Flink at Strata San Jose 2016

Counting hierarchy of needs

14

Continuous counting

... with low latency,

... efficiently on high volume streams,

... fault tolerant (exactly once),

... accurate and

repeatable,

... queryable

Based on Maslow's hierarchy of needs

Page 15: Apache Flink at Strata San Jose 2016

Counting hierarchy of needs

15

Continuous counting

Page 16: Apache Flink at Strata San Jose 2016

Counting hierarchy of needs

16

Continuous counting

... with low latency,

Page 17: Apache Flink at Strata San Jose 2016

Counting hierarchy of needs

17

Continuous counting

... with low latency,

... efficiently on high volume streams,

Page 18: Apache Flink at Strata San Jose 2016

Counting hierarchy of needs

18

Continuous counting

... with low latency,

... efficiently on high volume streams,

... fault tolerant (exactly once),

Page 19: Apache Flink at Strata San Jose 2016

Counting hierarchy of needs

19

Continuous counting

... with low latency,

... efficiently on high volume streams,

... fault tolerant (exactly once),

... accurate and

repeatable,

Page 20: Apache Flink at Strata San Jose 2016

Counting hierarchy of needs

20

Continuous counting

... with low latency,

... efficiently on high volume streams,

... fault tolerant (exactly once),

... accurate and

repeatable,

... queryable1.1+

Page 21: Apache Flink at Strata San Jose 2016

Rest of this talk

21

Continuous counting

... with low latency,

... efficiently on high volume streams,

... fault tolerant (exactly once),

... accurate and

repeatable,

... queryable

Page 22: Apache Flink at Strata San Jose 2016

Latency

22

Page 23: Apache Flink at Strata San Jose 2016

23

Yahoo! Streaming Benchmark Storm, Spark Streaming, and Flink benchmark by the

Storm team at Yahoo!

Focus on measuring end-to-end latency at low throughputs

First benchmark that was modeled after a real application

Read more:• https

://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

Page 24: Apache Flink at Strata San Jose 2016

Benchmark task: counting!

24

Count ad impressions grouped by campaign

Compute aggregates over last 10 seconds

Make aggregates available for queries (Redis)

Page 25: Apache Flink at Strata San Jose 2016

Results (lower is better)

25

Flink and Storm at sub-second latencies

Spark has a latency-throughput tradeoff

170k events/sec

Page 26: Apache Flink at Strata San Jose 2016

Efficiency, and scalability

26

Page 27: Apache Flink at Strata San Jose 2016

27

Handling high-volume streams Scalability: how many events/sec can a

system scale to, with infinite resources?

Scalability comes at a cost (systems add overhead to be scalable)

Efficiency: how many events/sec can a system scale to, with limited resources?

Page 28: Apache Flink at Strata San Jose 2016

Extending the Yahoo! benchmark

28

Yahoo! benchmark is a great starting point to understand engine behavior.

However, benchmark stops at low write throughput and programs are not fault tolerant.

We extended the benchmark to (1) high volumes, and (2) to use Flink's built-in state.• http://data-artisans.com/extending-the-yahoo-streaming-ben

chmark/

Page 29: Apache Flink at Strata San Jose 2016

Results (higher is better)

29

Also: Flink jobs are correct under failures (exactly once), Storm jobs are not

500k events/sec

3mi events/sec

15mi events/sec

Page 30: Apache Flink at Strata San Jose 2016

Fault tolerance and repeatability

30

Page 31: Apache Flink at Strata San Jose 2016

31

Stateful streaming applications Most interesting stream applications are stateful

How to ensure that state is correct after failures?

Page 32: Apache Flink at Strata San Jose 2016

Fault tolerance definitions At least once

• May over-count • In some cases even under-count with different definition

Exactly once• Counts are the same after failure

End-to-end exactly once• Counts appear the same in an external sink (database, file

system) after failure

32

Page 33: Apache Flink at Strata San Jose 2016

Fault tolerance in Flink Flink guarantees exactly once

End-to-end exactly once supported with specific sources and sinks • E.g., Kafka Flink HDFS

Internally, Flink periodically takes consistent snapshots of the state without ever stopping the computation

33

Page 34: Apache Flink at Strata San Jose 2016

Savepoints Maintaining stateful

applications in production is challenging

Flink savepoints: externally triggered, durable checkpoints

Easy code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests

34

Page 35: Apache Flink at Strata San Jose 2016

Explicit handling of time

35

Page 36: Apache Flink at Strata San Jose 2016

36

Notions of time

1977 1980 1983 1999 2002 2005 2015

Episode IV:A New Hope

Episode V:The EmpireStrikes Back

Episode VI:Return of the Jedi

Episode I:The Phantom

Menace

Episode II:Attach of the Clones

Episode III:Revenge of

the Sith

Episode VII:The Force Awakens

This is called event time

This is called processing time

Page 37: Apache Flink at Strata San Jose 2016

37

Out of order streams

Event time windows

Arrival time windows

Instant event-at-a-time

First burst of eventsSecond burst of events

Page 38: Apache Flink at Strata San Jose 2016

Why event time Most stream processors are limited to

processing/arrival time, Flink can operate on event time as well

Benefits of event time • Accurate results for out of order data• Sessions and unaligned windows• Time travel (backstreaming)

38

Page 39: Apache Flink at Strata San Jose 2016

What's coming up in Flink

39

Page 40: Apache Flink at Strata San Jose 2016

40

Evolution of streaming in Flink Flink 0.9 (Jun 2015): DataStream API in beta, exactly-once

guarantees via checkpoiting

Flink 0.10 (Nov 2015): Event time support, windowing mechanism based on Dataflow/Beam model, graduated DataStream API, high availability, state interface, new/updated connectors (Kafka, Nifi, Elastic, ...), improved monitoring

Flink 1.0 (Mar 2015): DataStream API stability, out of core state, savepoints, CEP library, improved monitoring, Kafka 0.9 support

Page 41: Apache Flink at Strata San Jose 2016

Upcoming features SQL: ongoing work in collaboration with Apache Calcite

Dynamic scaling: adapt resources to stream volume, historical stream processing

Queryable state: ability to query the state inside the stream processor

Mesos support

More sources and sinks (e.g., Kinesis, Cassandra) 41

Page 42: Apache Flink at Strata San Jose 2016

Queryable state

42

Using the stream processor as a database

Page 43: Apache Flink at Strata San Jose 2016

Closing

43

Page 44: Apache Flink at Strata San Jose 2016

44

Summary Stream processing gaining momentum, the right

paradigm for continuous data applications

Even seemingly simple applications can be complex at scale and in production – choice of framework crucial

Flink: unique combination of capabilities, performance, and robustness

Page 45: Apache Flink at Strata San Jose 2016

Flink in the wild

45

30 billion events daily 2 billion events in 10 1Gb machines

Picked Flink for "Saiki"data integration &

distribution platform

See talks by at

Page 46: Apache Flink at Strata San Jose 2016

Join the community!

46

Follow: @ApacheFlink, @dataArtisans Read: flink.apache.org/blog,

data-artisans.com/blog Subscribe: (news | user | dev) @ flink.apache.org

Page 47: Apache Flink at Strata San Jose 2016

2 meetups next week in Bay Area!

47

April 6, San JoseApril 5, San Francisco

What's new in Flink 1.0 & recent performance benchmarks with Flinkhttp://www.meetup.com/Bay-Area-Apache-Flink-Meetup/