Upload
brice84
View
276
Download
0
Embed Size (px)
DESCRIPTION
Slides from my talk about working on an ETL project through three iterations over the course of a year. Check out https://github.com/nevern02/rodimus for the Rodimus project. For some further background on Rodimus, check out http://www.blrice.net/blog/2014/06/03/etl-with-ruby-and-rodimus/
Citation preview
ETL All The Things
With Ruby
BIG DATA
What is ETL?
➢ Extract, Transform, Load
➢Move data from Point A to Point B
➢Maybe do something in between
➢ Commonly used in data warehousing
Real World Example: Model Events
➢Record history of changes to AR models
➢Record only the fields that changed
➢Storage in Mongo (no schema)
➢Analysts use only Windows SQL clients
Model Events: Considerations
➢Changed data capture (CDC)
➢Dynamic schema determination
○ Picking a column type
○ Altering existing column types
○ Trying not to take f***ing forever
Model Events: Considerations
➢Changed data capture (CDC)
➢Dynamic schema determination
○ Picking a column type
○ Altering existing column types
○ Trying not to take f***ing forever
Model Events: SchemaBefore After
First Iteration: Ruby
ETL Tools
➢ Ad hoc scripts
➢ Code generators
➢ Engines
➢Model Driven Architecture (MDA)
Source: Pentaho Kettle Solutions (2010)
ETL Example: Cleaning DataBefore After
City State
Baltimore Maryland
Harrisburg Pa
Philadelphia Penn
ETL
City State
Baltimore MD
Harrisburg PA
Philadelphia PA
Second Iteration: Kettle
➢ Java-based ETL engine
➢Open source
➢GUI development environment
➢Optimized for non-programmers
Second Iteration: Kettle
Second Iteration: Kettle
Second Iteration: Kettle
Second Iteration: Kettle
Kettle/ETL Resources➢ Kettle
○ http://community.pentaho.com
➢ Ruby plugin for Kettle
○ https://github.com/nevern02/Ruby-Scripting-for-Kettle
➢ Recommended book
○ Pentaho Kettle Solutions: Building Open Source ETL Solutions With
Pentaho Data Integration (2010)
■ Matt Casters, Roland Bouman, Jos van Dongen
Rodimus
➢ https://github.com/nevern02/rodimus
➢ Ruby ETL framework
➢ Lightweight, minimal dependencies
➢ Extensible
➢ Custom solutions built on top
http://blitz-wing.deviantart.com/art/Rodimus-Prime-183670417
Rodimus
Source Destination1 32
Transformation
Rodimus Example: Rails Logs
1.Read input
2.Transform data
3.Write output
Step 1: Read Input
Step 2:
Transform
Step 3: Write Output
Before After
Rodimus Concepts
➢Steps
➢Transformations
➢Parallelism
Rodimus Concepts: Steps
➢Smallest unit of work
➢Single task
➢Process one row of data at a time
1
Rodimus Concepts: Transformations
➢Contains an array of steps
➢Steps connected by IO
objects
➢First and last steps are
source/destination
Transformation
Rodimus Concepts: Parallelism
➢Each step runs in its own process
➢Steps run in parallel
➢No shared state*
* Not really true
1 32
What’s Next for Rodimus?➢ Benchmarking
○ Identify the bottlenecks
➢ Other platforms
○ Native thread use for JRuby/Rubinius
➢ Synchronous transformations?
➢ Better shared data implementation?
➢ Clustering?
Leveling Up
➢ Working with Unix Processes
➢ Working with Ruby Threads
http://www.jstorimer.com/
Leveling Up
Who Am I?
➢Brandon Rice
➢http://www.blrice.net/
➢@brandonlrice
➢Optoro
➢http://optoro.com
➢http://blinq.com
➢Always hiring!