20
Introduction to Apache Beam (incubating) Slides by Manu Zhang, July 2016

Introduction to Apache Beam - Meetupfiles.meetup.com/18743046/Introduction_to_Apache_Beam.pdf18 The Apache Beam Vision 1. End users: who want to write pipelines in a language that’s

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Introduction to Apache Beam(incubating)

Slides by Manu Zhang, July 2016

Unified Batch + strEAM processing model

Beam

credit: http://www.post-gazette.com/starwarscredit: http://reallyobsessedwithfilm.blogspot.com/2012/05/x-men-second-grade-we-need-leader.html

Beam

The Evolution of Apache Beam

MapReduce

BigTable DremelColossus

FlumeMegastoreSpanner

PubSub

MillwheelApache Beam

Google Cloud Dataflow

MapReduce

Tuesday [11:00 - 12:00)

[12:00 - 13:00)

[13:00 - 14:00)

[14:00 - 15:00)

[15:00 - 16:00)

[16:00 - 17:00)

[18:00 - 19:00)

[19:00 - 20:00)

[21:00 - 22:00)

[22:00 - 23:00)

[23:00 - 0:00)

Batch Patterns: Time Based Windows

Streaming Patterns: Event-Time Based Windows

Event Time

Processing Time 11:0010:00 15:0014:0013:0012:00

11:0010:00 15:0014:0013:0012:00

Input

Output

Formalizing Event-Time Skew

Watermarks describe event time progress.

"No timestamp earlier than the watermark will be seen"

Often heuristic-based.

Too Slow? Results are delayed.Too Fast? Some data is late.

The Beam Model: What is Being Computed?

PCollection<KV<String, Integer>> scores = input

.apply(Sum.integersPerKey());

The Beam Model: Where in Event Time?

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))

.apply(Sum.integersPerKey());

The Beam Model: When in Processing Time?

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(Afterwatermark.pastEndOfWindow()))

.apply(Sum.integersPerKey());

The Beam Model: How Do Refinements Relate?

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(Afterwatermark.pastEndOfWindow())

.discardingFiredPanes()) // or accumulatingFiredPanes()

.apply(Sum.integersPerKey());

12

The Beam Model: Asking the Right Questions

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

The Beam Model: Batch

PCollection<String> input = pipeline.apply(HDFSSource.read());PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(Afterwatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersPerKey());

The Beam Model: Streaming

PCollection<String> input = pipeline.apply(KafkaSource.read());PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(Afterwatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersPerKey());

The Beam Model: Spark Runner

Pipeline pipeline = Pipeline.create(“SparkRunner”);PCollection<String> input = pipeline.apply(KafkaSource.read());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(Afterwatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersPerKey());

The Beam Model: Flink Runner

Pipeline pipeline = Pipeline.create(“FlinkRunner”);PCollection<String> input = pipeline.apply(KafkaSource.read());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(Afterwatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersPerKey());

Why Apache Beam?

Unified - One model handles batch and streaming use cases.

Portable - Pipelines can be executed on multiple execution environments, avoiding lock-in.

Extensible - Supports user and community driven SDKs, Runners, transformation libraries, and IO connectors.

18

The Apache Beam Vision

1. End users: who want to write pipelines in a language that’s familiar.

2. SDK writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to support Beam pipelines

Beam Model: Fn Runners

Apache Flink

Apache Spark

Beam Model: Pipeline Construction

OtherLanguagesBeam Java

Beam Python

Execution Execution

Cloud Dataflow

Execution

19

Learn More!

Apache Beam (incubating) http://beam.incubator.apache.org

The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Why Apache Beam? A Google Perspectivehttp://goo.gl/eWTLH1

Join the mailing lists! User discussions - [email protected] discussions - [email protected]

Follow @ApacheBeam on Twitter