34
ETL All The Things With Ruby

ETL All The Things with Ruby

  • Upload
    brice84

  • View
    276

  • Download
    0

Embed Size (px)

DESCRIPTION

Slides from my talk about working on an ETL project through three iterations over the course of a year. Check out https://github.com/nevern02/rodimus for the Rodimus project. For some further background on Rodimus, check out http://www.blrice.net/blog/2014/06/03/etl-with-ruby-and-rodimus/

Citation preview

Page 1: ETL All The Things with Ruby

ETL All The Things

With Ruby

Page 2: ETL All The Things with Ruby

BIG DATA

Page 3: ETL All The Things with Ruby

What is ETL?

➢ Extract, Transform, Load

➢Move data from Point A to Point B

➢Maybe do something in between

➢ Commonly used in data warehousing

Page 4: ETL All The Things with Ruby

Real World Example: Model Events

➢Record history of changes to AR models

➢Record only the fields that changed

➢Storage in Mongo (no schema)

➢Analysts use only Windows SQL clients

Page 5: ETL All The Things with Ruby

Model Events: Considerations

➢Changed data capture (CDC)

➢Dynamic schema determination

○ Picking a column type

○ Altering existing column types

○ Trying not to take f***ing forever

Page 6: ETL All The Things with Ruby

Model Events: Considerations

➢Changed data capture (CDC)

➢Dynamic schema determination

○ Picking a column type

○ Altering existing column types

○ Trying not to take f***ing forever

Page 7: ETL All The Things with Ruby

Model Events: SchemaBefore After

Page 8: ETL All The Things with Ruby

First Iteration: Ruby

Page 9: ETL All The Things with Ruby

ETL Tools

➢ Ad hoc scripts

➢ Code generators

➢ Engines

➢Model Driven Architecture (MDA)

Page 10: ETL All The Things with Ruby

Source: Pentaho Kettle Solutions (2010)

Page 11: ETL All The Things with Ruby

ETL Example: Cleaning DataBefore After

City State

Baltimore Maryland

Harrisburg Pa

Philadelphia Penn

ETL

City State

Baltimore MD

Harrisburg PA

Philadelphia PA

Page 12: ETL All The Things with Ruby

Second Iteration: Kettle

➢ Java-based ETL engine

➢Open source

➢GUI development environment

➢Optimized for non-programmers

Page 13: ETL All The Things with Ruby

Second Iteration: Kettle

Page 14: ETL All The Things with Ruby

Second Iteration: Kettle

Page 15: ETL All The Things with Ruby

Second Iteration: Kettle

Page 16: ETL All The Things with Ruby

Second Iteration: Kettle

Page 17: ETL All The Things with Ruby

Kettle/ETL Resources➢ Kettle

○ http://community.pentaho.com

➢ Ruby plugin for Kettle

○ https://github.com/nevern02/Ruby-Scripting-for-Kettle

➢ Recommended book

○ Pentaho Kettle Solutions: Building Open Source ETL Solutions With

Pentaho Data Integration (2010)

■ Matt Casters, Roland Bouman, Jos van Dongen

Page 18: ETL All The Things with Ruby

Rodimus

➢ https://github.com/nevern02/rodimus

➢ Ruby ETL framework

➢ Lightweight, minimal dependencies

➢ Extensible

➢ Custom solutions built on top

Page 19: ETL All The Things with Ruby

http://blitz-wing.deviantart.com/art/Rodimus-Prime-183670417

Page 20: ETL All The Things with Ruby

Rodimus

Source Destination1 32

Transformation

Page 21: ETL All The Things with Ruby

Rodimus Example: Rails Logs

1.Read input

2.Transform data

3.Write output

Page 22: ETL All The Things with Ruby

Step 1: Read Input

Page 23: ETL All The Things with Ruby

Step 2:

Transform

Page 24: ETL All The Things with Ruby

Step 3: Write Output

Page 25: ETL All The Things with Ruby

Before After

Page 26: ETL All The Things with Ruby
Page 27: ETL All The Things with Ruby

Rodimus Concepts

➢Steps

➢Transformations

➢Parallelism

Page 28: ETL All The Things with Ruby

Rodimus Concepts: Steps

➢Smallest unit of work

➢Single task

➢Process one row of data at a time

1

Page 29: ETL All The Things with Ruby

Rodimus Concepts: Transformations

➢Contains an array of steps

➢Steps connected by IO

objects

➢First and last steps are

source/destination

Transformation

Page 30: ETL All The Things with Ruby

Rodimus Concepts: Parallelism

➢Each step runs in its own process

➢Steps run in parallel

➢No shared state*

* Not really true

1 32

Page 31: ETL All The Things with Ruby

What’s Next for Rodimus?➢ Benchmarking

○ Identify the bottlenecks

➢ Other platforms

○ Native thread use for JRuby/Rubinius

➢ Synchronous transformations?

➢ Better shared data implementation?

➢ Clustering?

Page 32: ETL All The Things with Ruby

Leveling Up

➢ Working with Unix Processes

➢ Working with Ruby Threads

http://www.jstorimer.com/

Page 33: ETL All The Things with Ruby

Leveling Up

Page 34: ETL All The Things with Ruby

Who Am I?

➢Brandon Rice

➢http://www.blrice.net/

[email protected]

➢@brandonlrice

➢Optoro

➢http://optoro.com

➢http://blinq.com

➢Always hiring!