28
Building Scalable Big Data Pipelines Christian Gügi, Solution Architect 19.09.2013 NOSQL SEARCH ROADSHOW ZURICH

Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

Building Scalable Big Data Pipelines

Christian Gügi, Solution Architect

19.09.2013

NOSQL SEARCH ROADSHOW ZURICH

Page 2: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

AGENDA

 Opportunities & Challenges

 Integrating Hadoop

 Lambda Architecture

 Lambda in Practice

 Recommendations

Page 3: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

ABOUT ME

 Solution Architect @ YMC

 Founder and organizer Swiss Big Data User Group

  http://www.bigdata-usergroup.ch/

 Contact

[email protected]

  http://about.me/cguegi

  @chrisgugi

Page 4: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

ABOUT YMC

 Founded in 2001

 Based in Kreuzlingen, Switzerland

 Big Data Analytics, Web Solutions and

Mobile Applications

 24 experts

  Consulting, creation, engineering

Page 5: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

OPPORTUNITIES &

Page 6: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

A.  New sources and types from inside & outside organisations   “Internet of things”, sensors, RFID, intelligent devices, etc.

  Unstructured information – documents, web logs, email, social media, etc.

  Trusted 3rd party sources – industry provider & aggregators, governments “Open Data”, weather, etc.

B.  Technology innovations to exploit new world of data   Low cost storage and process power (cloud, on-premise &

hybrid)

  New software patterns to handle speed & volume, structured and unstructured (In-memory computation, Hadoop, Mapreduce, etc.)

  Revolution in user experience, analytics, recommendations

BIG DATA – WHAT IS THE BIG DEAL?

Page 7: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

BIG DATA – CHALLENGES

• Align business strategy

• Data Management

• Privacy protection

• Lack of skilled and experienced people

• Volume

• Velocity

• Variety

• Veracity

Character

of data

Overwhelming landscape & integration

Available talent

Organisational issues

Page 8: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

INTEGRATING

Page 9: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

TYPICAL RDBMS SZENARIO

Data

So

urc

es

Data

Syste

ms

Ap

ps

DWH RDBMS

RDBMS NFS Others

BI Web Mobile

ETL

Page 10: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

BIG DATA SZENARIO

Data

So

urc

es

Data

Syste

ms

Ap

ps

DWH RDBMS

RDBMS NFS Logs

BI Web Mobile

Social

Media Sensors

Hadoop

1) Recommendations, etc.

1)

Page 11: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

HADOOP ECOSYSTEM

Page 12: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

LAMBDA

Page 13: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

ARCHITECTURE

 Credits Nathan Marz

 Former Engineer at Twitter

 Storm, Cascalog, ElephantDB

LAMBDA

http://www.manning.com/marz/

Page 14: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

DESIGN PRINCIPLES

Lambda Architecture

 Human fault-tolerance

 Data immutability

 Re-computation

Page 15: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

HUMAN FAULT-TOLERANCE

Lambda Architecture

 Design for human error

 Bugs in code

  Accidental data loss

 Data corruption

 Protect good data, so you can always fix

what went wrong

Page 16: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

DATA IMMUTABILIY

Lambda Architecture

 Store data in it’s rawest form

 Create and read but no update

 No data can be lost

  To fix the system just delete bad data

 Can always revert to a true state

Page 17: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

DATA IMMUTABILIY

Lambda Architecture

Name Location Time

Alice Zurich 2009/03/29

Bob Lucerne 2012/04/12

Tom Bern 2010/04/09

Name Location Time

Alice Zurich 2009/03/29

Bob Lucerne 2012/04/12

Tom Bern 2010/04/09

Alice Basel 2013/08/20

Name Location

Alice Zurich

Bob Lucerne

Tom Bern

Name Location

Alice Basel

Bob Lucerne

Tom Bern

Capturing change traditionally (mutability)

Capturing change (immutability)

Page 18: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

RE-COMPUTATION

Lambda Architecture

 Always able to re-compute from historical

data

 Basis for all data systems

  query = function(all data)

All Data Pre-computed

views Query

Page 19: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

LAYERS

Lambda Architecture

http://www.ymc.ch/en/lambda-architecture-part-1

Page 20: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

Lambda in Practice

Page 21: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

ONLINE MARKETING

 Tracking and analytics solution

 Improve customer targeting and

segmentation

 Various reports

 Real-time not required

Page 22: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

OVERVIEW

HBase

FTP

HDFS Ad

Serv

er

Flume

log

HDFS

Campaign

Database Sqoop

csv

csv

Up- &

Download

Hive

fs -put

Ag

gre

gate

d

Data

Web

Pig Impala

DWH

BI apps

Oozie ZooKeeper Cloudera

Manager

Page 23: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

DATA PIPELINE

FTP

HDFS

Ad

Serv

er

Flume

log

HDFS

Campaign

Database Sqoop

csv

csv

fs -put

M/R Avro

Avro

Avro

Extracting Transformation

M/R

M/R

Loading

Tracking

Profiles

Bulk Importer

DWH

Page 24: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

ADVANTAGES

 Extensible – easily add speed layer later on

 Complements existing DWH/BI system

 ETL phases are decoupled

 Reliable   Infrastructure

  Each step can be replayed

 Scalable   Storage

  Processing

 Highly available

 Ad-hoc analysis right from the beginning

Page 25: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

RECOMMENDATIONS

Page 26: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

RECOMMENDATIONS

 Not a fixed, one-size-fits-all approach

  Adopt to your needs/requirements

 Hadoop complements existing systems

 How real-time do I need to be?

  Immutability and pre-computation are just good

ideas!

  Store information in rawest format possible

  Use a serialization framework (Avro, Thrift, Protocol

Buffers)

Page 27: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

THANK YOU!

Page 28: Building Scalable Big Data Pipelines - nosqlroadshow.comnosqlroadshow.com/dl/basho-roadshow-zurich-2013/... · Building Scalable Big Data Pipelines Christian Gügi, Solution Architect

YMC AG

Sonnenstrasse 4

CH-8280 Kreuzlingen

Switzerland

Photo Credits:

Slide 05: Success opportunity achieve by Stephen McCulloch

Slide 08: Matrix by Gamaliel Espinoza Macedo.

Slide 12: Layers by Katelyn Leblanc

Slide 20: Mining For Information by JD Hancock

Slide 27: Warning Question by longzijun

@chrisgugi

CONTACT US [email protected]

Tel. +41 (0)71 508 24 76

www.ymc.ch