Apache Beam @ GCPUG.TW Flink.TW 20161006

Preview:

Citation preview

Apache Beam in Data Pipeline

Randy Huang 2016/10/06

Who am I

• Data Architect @ VMFive

• Fluentd/Embulk fans

Overview

• Define Data Pipeline

• Architecture

• How to write Beam

• Demo

Data PipelineInput Algorithm Output

Why Apache Beam?

Data Pipeline’s world is chaos

Goal

• Provide an abstraction layer between data processing’s code and the execution runtime.

• Batch processing and Streaming Jobs in one world.

• Beam SDK open the door to write once, run anywhere.*

on-premise and non-Google cloud

Supported Runners

• Google Cloud Dataflow (Block/Non-Blocking)

• Apache Flink 1.1.2

• Apache Spark 1.6.2 Hadoop 2.2.0 Kafka 0.8.2.1

API, model, and engine

Architecture

• Pipelines

• Translators

• Runners

programming tips/ Flink

• Use the Flink DataStream API in Java and Scala

• Use the Beam API directly in Java (and soon Python) with the Flink runner

SDK

• Four Parts :

• Pipeline : Streaming & Batch Processing

• PCollection

• Transform

• I/O : Source & Sink

for Flink user• we encourage users to use either of the Beam or Flink

APIs to implement their Flink jobs for stream data processing.

• But Native Flink API -

• backwards-compatible API

• built-in libraries (e.g., CEP and upcoming SQL)

• key-value state (with the ability to query that state in the future)

http://data-artisans.com/why-apache-beam/

Demo• GDELT project

• EventCount by Location

Pileline

Recap

• Write the general data pipeline, and choose your runner

Next…

• New Runners, SDK (python still dev)

• DSL

Another things

• BigQuery have DML support!!! https://goo.gl/lcZQVZ

• DataStudio Beta in Taiwan is available

• Embulk

• Fluentd v0.14.6 - 2016/09/07

forward secure

remember to setup nginx

Recommended