52
Cloud Dataflow A Google Service and SDK

Google Cloud Dataflow

Embed Size (px)

Citation preview

Page 1: Google Cloud Dataflow

Cloud DataflowA Google Service and SDK

Page 2: Google Cloud Dataflow

The Presenter

Alex Van Boxel, Software Architect, software engineer, devops, tester,...

at Vente-Exclusive.com

Introduction

Twitter @alexvb

Plus +AlexVanBoxel

E-mail [email protected]

Web http://alex.vanboxel.be

Page 3: Google Cloud Dataflow

Dataflowwhat, how, where?

Page 4: Google Cloud Dataflow

Dataflow is...

A set of SDK that define the programming model that

you use to build your streaming and batch

processing pipeline (*)

Google Cloud Dataflow is a fully managed service that will run and optimize your

pipeline

Page 5: Google Cloud Dataflow

Dataflow where...

ETL● Move● Filter● Enrich

Analytics● Streaming Compu● Batch Compu● Machine Learning

Composition● Connecting in Pub/Sub

○ Could be other dataflows○ Could be something else

Page 6: Google Cloud Dataflow

NoOps Model

● Resources allocated on demand● Lifecycle managed by the service● Optimization● Rebalancing

Cloud Dataflow Benefits

Page 7: Google Cloud Dataflow

Unified Programming Model

Unified: Streaming and BatchOpen Sourced● Java 8 implementation● Python in the works

Cloud Dataflow Benefits

Page 8: Google Cloud Dataflow

Unified Programming Model (streaming)Cloud Dataflow Benefits

DataFlow(streaming)

DataFlow (batch)

BigQuery

Page 9: Google Cloud Dataflow

Unified Programming Model (stream and backup)Cloud Dataflow Benefits

DataFlow(streaming)

DataFlow (batch)

BigQuery

Page 10: Google Cloud Dataflow

Unified Programming Model (batch)Cloud Dataflow Benefits

DataFlow(streaming)

DataFlow (batch)

BigQuery

Page 11: Google Cloud Dataflow

BeamApache Incubation

Page 12: Google Cloud Dataflow

DataFlow SDK becomes Apache BeamCloud Dataflow Benefits

Page 13: Google Cloud Dataflow

DataFlow SDK becomes Apache Beam

● Pipeline first, runtime second○ focus your data pipelines, not the characteristics runner

● Portability ○ portable across a number of runtime engines

○ runtime based on any number of considerations

■ performance

■ cost

■ scalability

Cloud Dataflow Benefits

Page 14: Google Cloud Dataflow

DataFlow SDK becomes Apache Beam

● Unified model ○ Batch and streaming are integrated into a unified model

○ Powerful semantics, such as windowing, ordering and triggering

● Development tooling○ tools you need to create portable data pipelines quickly and easily

using open-source languages, libraries and tools.

Cloud Dataflow Benefits

Page 15: Google Cloud Dataflow

DataFlow SDK becomes Apache Beam

● Scala Implementation

● Alternative Runners○ Spark Runner from Cloudera Labs

○ Flink Runner from data Artisans

Cloud Dataflow Benefits

Page 16: Google Cloud Dataflow

DataFlow SDK becomes Apache Beam

● Cloudera

● data Artisans

● Talend

● Cask

● PayPal

Cloud Dataflow Benefits

Page 17: Google Cloud Dataflow

SDK PrimitivesAutopsy of a dataflow pipeline

Page 18: Google Cloud Dataflow

Pipeline

● Inputs: Pub/Sub, BigQuery, GCS, XML, JSON, …

● Transforms: Filter, Join, Aggregate, …

● Windows: Sliding, Sessions, Fixed, …

● Triggers: Correctness….

● Outputs: Pub/Sub, BigQuery, GCS, XML, JSON, ...

Logical Model

Page 19: Google Cloud Dataflow

PCollection<T>

● Immutable collection● Could be bound or unbound● Created by

○ a backing data store○ a transformation from other PCollection○ generated

● PCollection<KV<K,V>>

Dataflow SDK Primitives

Page 20: Google Cloud Dataflow

ParDo<I,O>

● Parallel Processing of each element in the PCollection

● Developer provide DoFn<I,O>

● Shards processed independent

of each other

Dataflow SDK Primitives

Page 21: Google Cloud Dataflow

ParDo<I,O>

Interesting concept of side outputs: allows for data sharing

Dataflow SDK Primitives

Page 22: Google Cloud Dataflow

PTransform

● Combine small operations into bigger transforms● Standard transform (COUNT, SUM,...)● Reuse and Monitoring

Dataflow SDK Primitives

public class BlockTransform extends PTransform<PCollection<Legacy>, PCollection<Event>> {

@Override public PCollection<Event> apply(PCollection<Legacy> input) { return input.apply(ParDo.of(new TimestampEvent1Fn())) .apply(ParDo.of(new FromLegacy2EventFn())).setCoder(AvroCoder.of(Event.class)) .apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20)))) .apply(ParDo.of(new ExtactChannelUserFn())) }}

Page 23: Google Cloud Dataflow

Merging

● GroupByKey○ Takes a PCollection of KV<K,V>

and groups them○ Must be keyed

● Flatten ○ Join with same type

Dataflow SDK Primitives

Page 24: Google Cloud Dataflow

Input/OutputDataflow SDK Primitives

PipelineOptions options = PipelineOptionsFactory.create();Pipeline p = Pipeline.create(options);

// streamData is Unbounded; apply windowing afterward.PCollection<String> streamData =

p.apply(PubsubIO.Read.named("ReadFromPubsub") .topic("/topics/my-topic"));

PipelineOptions options = PipelineOptionsFactory.create();Pipeline p = Pipeline.create(options);PCollection<TableRow> weatherData = p

.apply(BigQueryIO.Read.named("ReadWeatherStations") .from("clouddataflow-readonly:samples.weather_stations"));

Page 25: Google Cloud Dataflow

● Multiple Inputs + Outputs

Input/OutputDataflow SDK Primitives

Page 26: Google Cloud Dataflow

Primitives on steroidsAutopsy of a dataflow pipeline

Page 27: Google Cloud Dataflow

Windows

● Split unbounded collections (ex. Pub/Sub)● Works on bounded collection as well

Dataflow SDK Primitives

.apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20))))

Window.<CAL>into(SlidingWindows.of(Duration.standardMinutes(5)));

Page 28: Google Cloud Dataflow

Triggers

● Time-based● Data-based● Composite

Dataflow SDK Primitives

AfterEach.inOrder( AfterWatermark.pastEndOfWindow(), Repeatedly.forever(AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(10))) .orFinally(AfterWatermark .pastEndOfWindow() .plusDelayOf(Duration.standardDays(2))));

Page 29: Google Cloud Dataflow

14:22Event time 14:21

Input/OutputDataflow SDK Primitives

14:21 14:22

14:21:25Processing time

Event time

14:2314:22

Watermark

14:21:20 14:21:40

Early trigger Late trigger Late trigger

14:25

Page 30: Google Cloud Dataflow

14:22Event time 14:21

Input/OutputDataflow SDK Primitives

14:21 14:22

14:21:25Processing time

Event time

14:2314:22

Watermark

14:21:20 14:21:40

Early trigger Late trigger Late trigger

14:25

Page 31: Google Cloud Dataflow

Google Cloud DataflowThe service

Page 32: Google Cloud Dataflow

What you need

● Google Cloud SDK○ utilities to authenticate and manage project

● Java IDE○ Eclipse○ IntelliJ

● Standard Build System○ Maven○ Gradle

Dataflow SDK Primitives

Page 33: Google Cloud Dataflow

What you needDataflow SDK Primitives

subprojects { apply plugin: 'java'

repositories { maven { url "http://jcenter.bintray.com" } mavenLocal() }

dependencies { compile 'com.google.cloud.dataflow:google-cloud-dataflow-java-sdk-all:1.3.0' testCompile 'org.hamcrest:hamcrest-all:1.3' testCompile 'org.assertj:assertj-core:3.0.0' testCompile 'junit:junit:4.12' }}

Page 34: Google Cloud Dataflow

How it workGoogle Cloud Dataflow Service

Pipeline p = Pipeline.create(options);

p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))

.apply(FlatMapElements.via((String word) -> Arrays.asList(word.split("[^a-zA-Z']+"))) .withOutputType(new TypeDescriptor<String>() {})) .apply(Filter.byPredicate((String word) -> !word.isEmpty())) .apply(Count.<String>perElement()) .apply(MapElements .via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + wordCount.getValue()) .withOutputType(new TypeDescriptor<String>() {}))

.apply(TextIO.Write.to("gs://YOUR_OUTPUT_BUCKET/AND_OUTPUT_PREFIX"));

Page 35: Google Cloud Dataflow

Service

● Stage● Optimize (Flume)● Execute

Google Cloud Dataflow Service

Page 36: Google Cloud Dataflow

OptimizationGoogle Cloud Dataflow Service

Page 37: Google Cloud Dataflow

OptimizationGoogle Cloud Dataflow Service

Page 38: Google Cloud Dataflow

ConsoleGoogle Cloud Dataflow Service

Page 39: Google Cloud Dataflow

LifecycleGoogle Cloud Dataflow Service

Page 40: Google Cloud Dataflow

Compute Worker Scaling Google Cloud Dataflow Service

Page 41: Google Cloud Dataflow

Compute Worker Rescaling Google Cloud Dataflow Service

Page 42: Google Cloud Dataflow

Compute Worker Rebalancing Google Cloud Dataflow Service

Page 43: Google Cloud Dataflow

Compute Worker RescalingGoogle Cloud Dataflow Service

Page 44: Google Cloud Dataflow

LifecycleGoogle Cloud Dataflow Service

Page 45: Google Cloud Dataflow

BigQueryGoogle Cloud Dataflow Service

Page 46: Google Cloud Dataflow

TestabilityTest Driven Development Build-In

Page 47: Google Cloud Dataflow

Testability - Built inDataflow Testability

KV<String, List<CalEvent>> input = KV.of("", new ArrayList<>()); input.getValue().add(CalEvent.event("SESSION", "REFERRAL") .build()); input.getValue().add(CalEvent.event("SESSION", "START") .data("{\"referral\":{\"Sourc...}").build());

DoFnTester<KV<?, List<CalEvent>>, CalEvent> fnTester = DoFnTester.of(new SessionRepairFn()); fnTester.processElement(input);

List<CalEvent> calEvents = fnTester.peekOutputElements(); assertThat(calEvents).hasSize(2);

CalEvent referral = CalUtil.findEvent(calEvents, "REFERRAL"); assertNotNull("REFERRAL should still exist", referral);

CalEvent start = CalUtil.findEvent(calEvents, "START"); assertNotNull("START should still exist", start); assertNull("START should have referral removed.",

Page 48: Google Cloud Dataflow

AppendixFollow the yellow brick road

Page 49: Google Cloud Dataflow

Paper - Flume Java

FlumeJava: Easy, Efficient Data-Parallel Pipelines

http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf

More Dataflow Resources

Page 50: Google Cloud Dataflow

DataFlow/Bean vs Spark

Dataflow/Beam & Spark: A Programming Model Comparison

https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

More Dataflow Resources

Page 51: Google Cloud Dataflow

Github - Integrate Dataflow with Luigi

Google Cloud integration for Luigi

https://github.com/alexvanboxel/luigiext-gcloud

More Dataflow Resources

Page 52: Google Cloud Dataflow

Questions and AnswersThe last slide

Twitter @alexvb

Plus +AlexVanBoxel

E-mail [email protected]

Web http://alex.vanboxel.be