24
DataFlow & Beam Gabe Hamilton

DataFlow & Beam

Embed Size (px)

Citation preview

Page 1: DataFlow & Beam

DataFlow & Beam

Gabe Hamilton

Page 2: DataFlow & Beam

So you’ve built your perfect video game.

People all over the world are playing it.

Page 3: DataFlow & Beam

Now for Billing, High Scores, etcPeople are playing your game on servers all over the world.

It’s time to start crunching all your data for billing, high scores, error reports, etc.

The time that events happened is important.

You charge per minute played, with surge pricing!

Data often arrives late.

Network delays, Servers go down and send their data hours later.

Page 4: DataFlow & Beam

Google DataFlow?

Apache Beam?Yes!

Page 5: DataFlow & Beam

What we’re going to cover

What is Dataflow?

Start a demo!

DataFlow Code

Batches & Streaming

Event Time

Page 6: DataFlow & Beam

What is Google DataFlow?

Distributed Streaming (and Batch) Data processing engine

Pulls in data from Sources

Writes data to Sinks

Spins up data processing nodes, pushes your code out to them

Like Hadoop but handles unbounded data? Yep similar idea

Page 7: DataFlow & Beam

Up and running in 10 mins1. https://cloud.google.com/dataflow/getting-started

a. create a project

b. add dataflow API

c. create a google storage bucket

d. gcloud auth login

2. git clone [email protected]:gabehamilton/DataflowGroovySDK-examples.git

3. gradlew run -Pargs="project=PROJECT_NAME stagingLocation=BUCKET_NAME” (requires a JDK)

Page 8: DataFlow & Beam

Lets see some code - configDataflowPipelineOptions options =

PipelineOptionsFactory.create().as(DataflowPipelineOptions)

options.setProject( ‘myproject’ )options.setStagingLocation("gs://aStagingBucket")options.setNumWorkers(1000) // ← !!! default is 3options.setStreaming(true);

Pipeline pipeline = Pipeline.create(options);

Page 9: DataFlow & Beam

Lets see some code - pipelinepipeline // Extract and sum username/score pairs from the event data.

.apply(TextIO.Read.from(options.getInput())) // Read events from a text file

.apply(ParDo.named("ParseGameEvent").of(new ParseEventFn())) .apply("SumByUser", new ExtractAndSumScore("user"))

.apply("WriteUserScoreSums", new WriteToBigQuery(options.getTableName()));

Page 10: DataFlow & Beam

ComponentsPCollection - Parallel Collection

Standard interface to a set of data.

Can be a streaming data set.

PTransform - Parallel Transform

takes Input, produces Output

ParDo - Parallel Do

Your custom Transform Function

Page 11: DataFlow & Beam

Demo

Page 12: DataFlow & Beam

Running Dataflow - staging files

Our code

Output

Dependencies

Code is staged in the staging bucket

before gettingpushed to Workers

Page 13: DataFlow & Beam

Staged files - detail

Page 14: DataFlow & Beam

We don’t need no stinking BatchesHandles Batches!

Streaming!

Not real time streaming, unbounded data set streaming.

Continuously processing your

User Scores, Billing, Analytics

Risk, Spam, and other deviations from mean

Page 15: DataFlow & Beam

Event TimeDataflow lets you work in event time

when the event says it happened

rather than processing time

when the event was received

Allows Out of Order processing

a plane full of mobile users just landed, turned their phones back on and start delivering the past 2 hours of data

Page 16: DataFlow & Beam

Features for working with Event TimeWindowing hourly, session based

Watermarks All the data is in.

Fixed end of match, end of file

Heuristic the data is probably in,

Percentile 90% of the data is in

Triggers emitting partial results.

Accumulations ways of dealing with late data

Page 17: DataFlow & Beam

Streaming Event Time ExampleWindow

Hourly - events per hour

TriggerEach minute

Allowed Lateness12 hours

AccumulationDiscarding

How many errors occurred between 5-6pm.

As we process data, update windows every minute.

After 12 hours, discard any late data that arrives

When updating a window throw out the previous result replacing it with the new one.

Page 18: DataFlow & Beam

Code - streaming event time.into(FixedWindows.of(ONE_HOUR)) // Duration.standardHours(1)

.triggering(

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(ONE_MINUTE))

.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(TEN_MINUTES)))

.withAllowedLateness(TWELVE_HOURS)

.discardingFiredPanes())

Page 19: DataFlow & Beam

Demo 2Streaming

FraudDetection

Page 20: DataFlow & Beam

What is Apache Beam?A standard for running pipelines on different engines

Direct PipelineRunner (i. e. local)

Dataflow PipelineRunner

Flink PipelineRunner

Spark PipelineRunner (new)

Page 21: DataFlow & Beam

ApacheBeam

Page 22: DataFlow & Beam

What to rememberProcess lots of data

Out of order & Late data

On cluster of your choice

Locally testable

Page 23: DataFlow & Beam

Questions?

Answers Answers Answers Answers Answers Answers

Answers Answers Answers Answers Answers Answers

Answers Answers Answers Answers Answers Answers

Answers Answers Answers Answers Answers Answers

Answers Answers Answers Answers Answers Answers

Page 24: DataFlow & Beam

Thanks!

Image credits:http://fav.me/d80wco9 Game mashup

http://mrg.bz/UwguyD Red Beams

http://mrg.bz/ccBto0 Blue Beams

http://mrg.bz/QfHhyS Steel beam frame

http://mrg.bz/Dtcc1B Clock