93
BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect Farkas Péter - Data Engineer

GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

  • Upload
    others

  • View
    17

  • Download
    1

Embed Size (px)

Citation preview

Page 1: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW

2107.06.12.

Kassai Csaba - Lead Data Architect

Farkas Péter - Data Engineer

Page 2: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

BIG DATA IN THE GOOGLE CLOUD

● Google Cloud Storage

● Google BigQuery

● Apache Beam

● Google Cloud Pub/Sub

● Google Cloud Dataflow

● Case studies

Agenda

Page 3: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect
Page 4: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

GCP

Page 5: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Cloud Storage

Page 6: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Google Cloud Storage

When a write succeeds, the latest copy of the object is guaranteed to be returned to any GET, globally. This applies to PUTs of new or overwritten objects and DELETEs.

Consistency

Page 7: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Google Cloud Storage

Object Lifecycle Management

● delete live/archived objects

● “downgrade” storage class

Actions Conditions

● age● create time● live/archive● # newer versions● storage class

Page 8: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Google Cloud Storage

Pricing● Storage● Data retrieval (Nearline,

Coldline)● Network● Operations

Page 9: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Google Cloud Storage

Quickstart

https://cloud.google.com/storage/docs/quickstart-consolehttps://cloud.google.com/storage/docs/quickstart-gsutil

Page 10: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

A fast, economical and fully-managed

enterprise data warehouse for

large-scale data analytics

Google BigQuery

Page 11: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

enterprise data warehouse

Page 12: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

fast & large-scale

Page 13: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

fully-managed

Page 14: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

economical

Page 15: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Google BigQuery

Dremel

Page 16: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Google BigQuery

Structure

SQL QueryPetabit Network

BigQuery

Storage ComputeStreaming Ingest

Fast Batch Load

Page 17: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Google BigQuery

Columnar-storage

Size: 60 GB

c1 c2 c3 c4 c5

125 GB

80GB

45GB

99GB

20160101

20160102

20160103

20160104

20160105

Page 18: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Google BigQuery

(Almost) append-only

● Data Manipulation Language: with a lot of constraints

○ No required field

○ Empty streaming buffer

○ Partitioned tables are not supported

○ No multi-statement transaction

○ Limited concurrency

● Use as an append only db when possible

Page 19: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

A BRIEF INTRODUCTION TO BIG QUERY

Structure / Dataset

PROJECT

DATASETS

Contain a collection of tables, views

Access controll applied to all tables/views in dataset

ACLs for Readers, Writers and OwnersAccess can be granted to datasets for users who are not members of the project

Page 20: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

PROJECT

DATASETS

TABLES

A BRIEF INTRODUCTION TO BIG QUERY

Structure / Table

Data stored in managed storageCollection of columns and rows

Virtual tables defined by SQL query

Have a schema

Views are supported

Describes strongly-typed columns of values

Page 21: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

PROJECT

DATASETS

JOBS

TABLES

A BRIEF INTRODUCTION TO BIG QUERY

Structure / Job

Used to start all potentially long-running actions

Examples:

Can be cancelled

Queries, Importing / exporting data, Copying data

Page 22: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

A BRIEF INTRODUCTION TO BIG QUERY

Schema - Types

● INT, FLOAT, STRING, BOOLEAN, BYTE

● DATE, DATETIME, TIME, TIMESTAMP

● ARRAY: An ARRAY is an ordered list of zero or more elements of

non-ARRAY values

● STRUCT: Container of ordered fields each with a type (required) and field

name (optional).

Page 23: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

A BRIEF INTRODUCTION TO BIG QUERY

Query results

Used by caching

Free storage

Limited lifetime

TEMPORARY TABLES

permanent

billed

USER-DEFINED TABLES

Page 24: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

A BRIEF INTRODUCTION TO BIG QUERY

Pricing

/GB/month

in /MB/sec granularity

discount after 90 days

10 GB per month is free

STORAGE

amount of data processed by the query

First 1 TB/month free

Cached result free

Error - free

insert row by row via the REST API

/GB

QUERIESSTREAMING

INSERT

Page 25: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

A BRIEF INTRODUCTION TO BIG QUERY

Interfaces

WEB UI CLI RESTFUL API

Page 26: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

A BRIEF INTRODUCTION TO BIG QUERY

BQ basic exercises

https://cloud.google.com/bigquery/quickstart-web-ui goo.gl/jxU7a5

Page 27: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

A BRIEF INTRODUCTION TO BIG QUERY

BQ as ETL tool

Daily snapshots of the source table as

CSV

Dimension table with Type-2 history

in BQ

Page 28: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

A BRIEF INTRODUCTION TO BIG QUERY

BQ as ETL tool - Source

Schema● STORENO: unique id of the store● STORENAME: name of the store● CHAIN: name of the chain where store belong to. Can be null. ● STORETYPE: type of the store. INTERNAL or EXTERNAL. Only INTERNAL stores should be

imported into BQ. ● BATCHDATE: the date when the snapshot was created

Location: https://console.cloud.google.com/storage/browser/bdf-bigquery-demo/storedata/

Separator: ‘;’

Page 29: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

A BRIEF INTRODUCTION TO BIG QUERY

BQ as ETL tool - Target

BQ - Schema● code: unique id of the store● name: name of the store● chain: name of the chain where store belong to. Can be null. ● valid_from: Type2 history● valid_to: Type2 history

Page 30: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

A BRIEF INTRODUCTION TO BIG QUERY

BQ as ETL tool - SolutionData import:

bq load --autodetect --field_delimiter=';' --replace {project_id}:bdf_demo.store_raw gs://bdf-bigquery-demo/storedata/*

Query for data transformation:https://bigquery.cloud.google.com/savedquery/862243936433:2283d8e8c4e942e1ae5ed8f7ed3d1cbd

View for proper Type-2 history:

SELECT * EXCEPT(deleted) from ( SELECT *, LEAD(valid_from) over(PARTITION BY code ORDER BY valid_from ) AS valid_to FROM `{project_id}.bdf_demo.store`)WHERE deleted = FALSE

Page 31: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

PROGRAMMING MODEL RUNNERS

Page 32: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Processing- vs event-time

Source: The world beyond batch: Streaming 102 (Tyler Akidau)

Page 33: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Watermark

A watermark with a value of time X makes the statement: “all input data with event times less than X have been observed.” As such, watermarks act as a metric of progress when observing an unbounded data source with no known end.

Page 34: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Watermark

Source: The world beyond batch: Streaming 102 (Tyler Akidau)

Page 35: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Source SinkPTransformPCollection PCollection

APACHE BEAM MODEL

Pipeline structure

Page 36: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

What results are being computed?

Where in event time they are being computed?

When in processing time they are materialized?

How earlier results relate to later refinements?

APACHE BEAM MODEL

Concepts

Page 37: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Element wise Aggregation Composite

APACHE BEAM MODEL

What are you computing?

PTransform

Page 38: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

What are you computing? PCollection<Integer> salesRecords = ...;

PCollection<Integer> totalSales = salesRecords

.apply(new Sum.SumIntegerFn());

Page 39: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

What are you computing?

Source: The world beyond batch: Streaming 102 (Tyler Akidau)

Page 40: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

● Fixed

APACHE BEAM MODEL

Where in event time?

1

2

3

Key 1 Key 2 Key 3

Page 41: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

● Fixed

● Sliding

APACHE BEAM MODEL

Where in event time?

12

3

Key 1 Key 2 Key 3

Page 42: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

● Fixed

● Sliding

● Per-Session

APACHE BEAM MODEL

Where in event time?

12

4

Key 1 Key 2 Key 3

3

Page 43: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

PCollection<Integer> salesRecords = ...;

PCollection<Integer> totalSales = salesRecords

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.apply(new Sum.SumIntegerFn());

APACHE BEAM MODEL

Where in event time?

Page 44: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Where in event time?

Source: The world beyond batch: Streaming 102 (Tyler Akidau)

Page 45: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Time based Data-driven Composite

APACHE BEAM MODEL

When in processing time? Triggers

Triggers

Page 46: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

When in processing time? PCollection<Integer> salesRecords = ...;

PCollection<Integer> totalSales = salesRecords

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))

.withLateFirings(AtCount(1))))

.apply(new Sum.SumIntegerFn());

Page 47: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

When in processing time?

Source: The world beyond batch: Streaming 102 (Tyler Akidau)

Page 48: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Firing Elements Discarding Accumulating Accumulating & Retracting

Early 3, 4 7 7 7

Watermark 2, 6 8 15 15, -7

Late 3 3 18 18, -15

Total observed

18 18 40 18

APACHE BEAM MODEL

How refinements relate?

Page 49: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

What Where When How PCollection<Integer> salesRecords = ...;

PCollection<Integer> totalSales = salesRecords

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterProcessingTime

.pastFirstElementInPane()

.plusDelayOf(Duration.standardMinutes(1)))

.discardingFiredPanes())

.apply(new Sum.SumDoubleFn());

Page 50: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Events

00:01 00:02 00:03

23:59 00:00 00:01 00:02 00:03

5 732 6 4 71 9 8

23:59 00:00

5 7 3 2 6 4 1 97 8

10 27 15

Page 51: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 52: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:00:02

watermark23:59:00

Page 53: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:00:17

watermark23:59:00

Page 54: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:00:21

watermark23:59:00

Page 55: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:00:27

watermark00:00:00

Page 56: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:00:48

watermark00:00:00

Page 57: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:00:50

watermark00:00:00

Page 58: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:00:58

watermark00:00:00

Page 59: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:01:02

watermark00:00:00

Page 60: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:01:12

watermark00:00:00

Page 61: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:01:16

watermark00:01:00

Page 62: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:01:48

watermark00:01:00

Page 63: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:02:01

watermark00:01:00

Page 64: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 1 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:02:22

watermark00:02:00

Page 65: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 2 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 66: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

time00:00:02

watermark23:59:00

Page 67: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:00:17

watermark23:59:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 68: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:00:21

watermark23:59:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 69: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:00:27

watermark00:00:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 70: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:00:47

watermark00:00:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 71: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:00:48

watermark00:00:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 72: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:00:50

watermark00:00:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 73: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:00:58

watermark00:00:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 74: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:01:02

watermark00:00:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 75: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:01:12

watermark00:00:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 76: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:01:16

watermark00:01:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 77: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:01:32

watermark00:01:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 78: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:01:48

watermark00:01:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 79: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:02:01

watermark00:01:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 80: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:02:18

watermark00:01:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 81: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

time00:02:22

watermark00:02:00

APACHE BEAM MODEL

Live demo - Pipeline 2

pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.accumulatingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 82: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

APACHE BEAM MODEL

Live demo - Pipeline 3 pipeline.apply(HumanIO.read()).setCoder(StickyNotesCoder.of());

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))

.triggering(AfterWatermark.pastEndOfWindow())

.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardSeconds(30)))

.withLateFirings(AfterPane.elementCountAtLeast(1)))

.discardingFiredPanes())

.apply(Sum.integersGlobally());

.apply(FlipChartIO.write())

Page 83: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Events

Page 84: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

● Messaging - Many-to-many topology

● Topic - subscription model

● No-ops

● At-least one delivery

● Rest API

● Scalable: 10000 message/sec by default

Google Cloud Pub/Sub

Google Cloud Pub/Sub

Page 85: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Google Cloud Pub/Sub

Intro

https://cloud.google.com/pubsub/docs/quickstart-consolehttps://cloud.google.com/pubsub/docs/quickstart-cli

Page 86: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

CLOUD DATAFLOW

The Cloud Dataflow runner

● Fully managed, no-ops

execution environment

● Seamless integration with other

GCP services

Page 87: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

● Autoscale

CLOUD DATAFLOW

Fully managed

● Dynamic work rebalancing

● Graph optimization

● Worker lifecycle management

Page 88: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

CLOUD DATAFLOW

Monitoring interface

Page 89: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

CLOUD DATAFLOW

Logging

Page 91: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

CLOUD DATAFLOW

Codelab

goo.gl/k0qH7a

Page 92: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

Processing and storing sales transactions in real time, in order to do:

● Performance metrics● Demand prediction● Logistic optimization● Collecting and selling insights

USE CASE

Retail BI system

Page 93: GOOGLE CLOUD BIG DATA IN THE BIGQUERY, APACHE BEAM, … · 2017. 6. 29. · BIG DATA IN THE GOOGLE CLOUD BIGQUERY, APACHE BEAM, DATAFLOW 2107.06.12. Kassai Csaba - Lead Data Architect

USE CASE

Architecture