Apache Beam and Google Cloud Dataflow - IDG - final

Google Cloud Dataflowthe next generation of managed big data service based on the Apache Beam programming model

Sub Szabolcs Feczak, Cloud Solutions Engineer

Google

9th Cloud & Data Center World 2016 - 한국 IDG

You leave here understanding the fundamentals of

the Apache Beam model and the Google Cloud Dataflow managed service

We have some fun.

Background and Historical overview

The trade-off quadrant of Big Data

CompletenessSpeed

Cost Optimization

Complexity

Time to Answer

MapReduce

Hadoop

MillWheel

Apache Beam

Streaming

Pipelines

Unified API

No Lam

Iterative

Interactive

Exactly Once

Timers

Auto-A

Waterm

Window

High-level API

Managed Service

Triggers

Open Source

Unified Engine*

ptimizer

* * **

Deep dive, probing familiarity with the subject

1M Devices

16.6K Events/sec

43B Events/month

518B Events/year

Before Apache Beam

Accuracy

Simplicity

Savings

Stream

Sophistication

Scalability

OROROROR

After Apache Beam

Accuracy

Simplicity

Savings

Stream

Sophistication

Scalability

ANDANDANDAND

Balancing correctness, latency and cost with a unified batch

with a streaming model

http://research.google.com/search.html?q=dataflow

Apache Beam (incubating)

Java https://github.com/GoogleCloudPlatform/DataflowJavaSDK

Python (ALPHA)

Scala /darkjh/scalaflow

/jhlch/scala-dataflow-dsl

SoftwareDevelopment Kits Runners

http://incubator.apache.org/projects/beam.htmlThe Dataflow submission to the Apache Incubator was accepted on February 1, 2016, and the resulting project is now called Apache Beam.

Spark runner@ /cloudera/spark-

dataflow

Flink runner @ /dataArtisans/flink-dataflow

• Movement

• Filtering

• Enrichment

• Shaping

• Reduction

• Batch computation

• Continuous computation

• Composition

• External orchestration

• Simulation

Where might you use Apache Beam?

AnalysisETL Orchestration

Why would you go with a managed service?

Managed Service

User Code & SDKWork Manager

Deploy & Schedule

Monitoring UI

Job Manager

Cloud Dataflow Managed Service advantages (GA since 2015 August)

Progress & Logs

Deploy Schedule & Monitor Tear Down

Worker Lifecycle ManagementCloud Dataflow Service

❯ Time & life never stop

❯ Data rates & schema are not static

❯ Scaling models are not static

❯ Non-elastic compute is wasteful and can create lag

Challenge: cost optimization

Auto-scaling800 QPS 1200 QPS 5000 QPS 50 QPS

10:00 11:00 12:00 13:00

Cloud Dataflow Service

100 mins. 65 mins.

Dynamic Work RebalancingCloud Dataflow Service

● ParDo fusion○ Producer Consumer○ Sibling○ Intelligent fusion

boundaries● Combiner lifting e.g. partial

aggregations before reduction

● http://research.google.com/search.html?q=flume%20java

Graph OptimizationCloud Dataflow Service

consumer-producer

= ParallelDo

GBK = GroupByKey

+ = CombineValues

sibling

A GBK + B

A+ GBK + B

combiner lifting

Deep dive into the programming model

The Apache Beam Logical Model

What are you computing?

Where in event time?

When in processing time?

How do refinements relate?

What are you computing?

● A Pipeline represents a graph

● Nodes are data processing

transformations

● Edges are data sets flowing

through the pipeline

● Optimized and executed as a

unit for efficiency

What are you computing? PCollections ● is a collection of homogenous

data of the same type

● Maybe be bounded or unbounded in size

● Each element has an implicit timestamp

● Initially created from backing data stores

Challenge: completeness when processing continuous data

9:008:00 14:0013:0012:0011:0010:00

8:008:00

What are you computing? PTransforms

transform PCollections into other PCollections.

What Where When How

Element-Wise(Map + Reduce = ParDo)

Aggregating(Combine, Join Group)

Composite

GroupByKey

Pair With Ones

Sum Values

❯ Define new PTransforms by building up subgraphs of existing transforms

❯ Some utilities are included in the SDK• Count, RemoveDuplicates, Join,

Min, Max, Sum, ...

❯ You can define your own:• DoSomething, DoSomethingElse,

❯ Why bother?• Code reuse• Better monitoring experience

Composite PTransformsApache BeamSDK

Example: Computing Integer Sums

What Where When How

Example: Computing Integer Sums

Sliding

Sessions

Where in Event Time?

● Windowing divides data into event-time-based finite chunks.

● Required when doing aggregations over unbounded data.

What Where When How

Example: Fixed 2-minute Windows

What Where When How

When in Processing Time?

● Triggers control when results are emitted.

● Triggers are often relative to the watermark.Pr

Event Time

WatermarkSkew

What Where When How

Example: Triggering at the Watermark

What Where When How

Example: Triggering for Speculative & Late Data

What Where When How

How do Refinements Relate?

● How should multiple outputs per window accumulate?

● Appropriate choice depends on consumer.

Firing Elements

Speculative 3

Watermark 5, 1

Late 2

Total Observ 11

Discarding

Accumulating

Acc. & Retracting

11, -9

What Where When How

Example: Add Newest, Remove Previous

1. Classic Batch 2. Batch with Fixed Windows

3. Streaming 5. Streaming with Retractions

4. Streaming with Speculative + Late Data

Customizing What Where When How

What Where When How

The key takeaway

Optimizing Your Time To Answer

More time to dig into your data

Programming

Resource provisioning

Performance tuning

Monitoring

ReliabilityDeployment & configuration

Handling Growing Scale

Utilization improvements

Data Processing with Cloud DataflowTypical Data Processing

Programming

How much more time?

You do not just save on processing, but code complexity and size as well!

Source: https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

What do customers have to say aboutGoogle Cloud Dataflow

"We are utilizing Cloud Dataflow to overcome elasticity challenges with our current Hadoop cluster. Starting with some basic ETL workflow for BigQuery ingestion, we transitioned into full blown clickstream processing and analysis. This has helped us significantly improve performance of our overall system and reduce cost."

Sudhir Hasbe, Director of Software Engineering, Zullily.com

“The current iteration of Qubit’s real-time data supply chain was heavily inspired by the ground-breaking stream processing concepts described in Google’s MillWheel paper. Today we are happy to come full circle and build streaming pipelines on top of Cloud Dataflow - which has delivered on the promise of a highly-available and fault-tolerant data processing system with an incredibly powerful and expressive API.”

Jibran Saithi, Lead Architect, Qubit

"We are very excited about the productivity benefits offered by Cloud Dataflow and Cloud Pub/Sub. It took half a day to rewrite something that had previously taken over six months to build using Spark"

Paul Clarke, Director of Technology, Ocado

“Boosting performance isn’t the only thing we want to get from the new system. Our bet is that by using cloud-managed

products we will have a much lower operational overhead. That in turn means we will have much more time to make

Spotify’s products better.”

Igor Maravić, Software Engineer working at Spotify

Demo Time!

Let’s build something - Demo!

Ingest stream from Wikipedia edits https://wikitech.wikimedia.org/wiki/Stream.wikimedia.org

Inspect the result set in our data warehouse (BigQuery)

Create a pipeline and run a Dataflow job to extract the top 10 active editors and top 10 pages edited

Extract words from a Shakespeare corpus, count the occurrences of each word, write sharded results as blobs into a key value store (Cloud Storage)

Thank You!cloud.google.com/dataflowcloud.google.com/blog/big-data/cloud.google.com/solutions/articles#bigdatacloud.google.com/newsletterresearch.google.com

Apache Beam and Google Cloud Dataflow - IDG - final

Documents

Apache Flink的过去、现在和未来²尼- Apache... · Flink 1.9 的架构变化 Runtime Distributed Streaming Dataflow Query Processor DAG & StreamOperator Local Single JVM Cloud

Hands-on with Apache NiFi and MiNiFi - Berlin Buzzwords...Apache NiFi / Integration, or ingestion, Frameworks NiFi End user facing dataflow management tool

Ingesting Data into Apache Kudu in CDP Public Cloud...May 01, 2020 · Cloudera DataFlow for Data Hub Ingesting Data into Apache Kudu in CDP Public Cloud • You have Kudu master

Dataflow I: Dataflow Analysis

Query Languages for Unrestricted Graph Data · • Cypher (neo4j) – declarative, highly similar to StruQL (and hence CRPQs) • Gremlin (Apache and commercial projects) – dataflow

Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

IDG 131 IDG - sedetecnica.com

Streaming Dataflow with Apache Flink

Graph Sampling with Distributed In-Memory Dataflow Systems · 2019-10-11 · Distributed Graph Sampling, Apache Flink, Apache Spark 1 INTRODUCTION Sampling is used to determine a

Apache NiFi Overview · Apache NiFi’sjob: Enterprise Dataflow Management 1 Automate the flow of data from any source …to systems which extract meaning and insight …and to those

Beyond Messaging Enterprise Dataflow powered by Apache NiFi

Scio - A Scala API for Google Cloud Dataflow & Apache Beam

Design a Dataflow in 7 minutes with Apache NiFi/HDF

@mattcasters Kettle Past - Present - Future Project Hop ...blog.jortilles.com/wp-content/uploads/2019/11/kcm19-mattcasters... · Apache Flink Google Cloud DataFlow Local runners:

Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose

Dataflow with Apache NiFi/MiNiFi - archive.fosdem.org · Dataflow with Apache NiFi/MiNiFi Andy LoPresto - @yolopey ... Connecting A to B to C Easy enough with Bash scripts, Ruby/Python/Groovy,

IDG 20070321

Dataflow II: Finish Dataflow Analysis, Start on Classical Optimizations

Apache beamとdataflow紹介

Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service