Upload
sreenivas-kappala
View
625
Download
17
Tags:
Embed Size (px)
Citation preview
Kettle – ETL Tool
Sreenivas K
Agenda Introduction
ETL Process Pentaho's Kettle
Data Integration Challenges Prerequisites and Recent Releases Pentaho DI Components Spoon
Transformations Jobs
Introduction – ETL Process
Major Components Extracting
Gathering raw data from source systems and storing it in ETL staging environment
Data Profiling Identifying data that changed since last load.
Transforming- Cleaning and Conforming
Processing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensions
Data cleansing Recording error events Audit dimensions Creating and maintaining conformed dimensions and facts
Introduction – ETL Process
Loading Loading data into data warehouse tables Managing hierarchies in dimensions Managing special dimensions such as date and time, junk, mini, shrunken,
small static, and user-maintained dimensions Fact table loading Building and maintaining bridge dimension tables Handling late arriving data Management of conformed dimensions Administration of fact tables Building aggregations Building OLAP cubes Transferring DW data to other environment for specific purposes
Data Transformation and Integration Examples
Data filtering Is not null, greater than, less than, includes
Field manipulation Trimming, padding, upper and lowercase conversion
Data calculations + - X / , average, absolute value, arctangent, natural logarithm
Date manipulation First day of month, Last day of month, add months, week of year, day of year
Data type conversion String to number, number to string, date to number
Merging fields & splitting fields Looking up date
Look up in a database, in a text file, an excel sheet, …
Introduction – Pentaho Kettle
Kettle – Kettle Extraction Transformation Transportation & Loading tool
Its open source business intelligence suite for powerful data integration by Pentaho. Founded in 2004.
Products of Pentaho Mondrain – OLAP server written in Java Kettle – ETL tool
Data Integration - Challenges
Data is everywhere Data is inconsistent
Records are different in each system
Performance issues Running queries to summarize data for stipulated
long period takes operating system for task
Data is never all in Data Warehouse Excel sheet, acquisition, new application
Prerequisites Recent Releases
Java Runtime Environment 1.5 and above
Compatible with almost any platform
Compatible with wide range of Databases technologies.
4/25 Data Integration 3.0.3 GA
4/18 Data Integration 3.1 Milestone 2/8 Data Integration 3.0.2 GA
12/12 Data Integration 3.0.1 GA
11/15 Data Integration 3.0 GA
10/31 Data Integration 3.0 RC2
10/24 Data Integration 2.5.2 GA
10/08 Data Integration 3.0 RC1
08/24 Data Integration 2.5.1 GA
Pentaho Components
Spoon GUI that allows you to design transformations and jobs that can
be run with the Kettle tools — Pan and Kitchen
Transformations and Jobs can describe themselves using an XML file or can be put in a Kettle database repository.
Spoon is available as executable script and batch file to make use of tool in heterogeneous environment.
Pan A program to execute transformations designed by Spoon in XML or
database repository.
Transformations are scheduled in batch mode to be run automatically at regular intervals
Kitchen Execute jobs designed by Spoon in XML or database repository
Repository Connection establishment Auto login
By setting manually KETTLE_REPOSITORY, KETTLE_USER and KETTLE_PASSWORD environmental variables.
Login By default PDI provides login username and
password ad admin.
Transformation Value: Values are part of a row
and can contain any type of data Row: a row exists of 0 or more
values Output stream: an output
stream is a stack of rows that leaves a step.
Input stream: an input stream is a stack of rows that enters a step.
Hop: A hop is a graphical representation of one or more data streams between 2 steps.
Note: A note is a piece of information that can be added to a transformation
Engine capable of performing a multitude of functions such as reading, manipulating and writing data to and from various data sources.
Jobs Job Entry: A job entry is
one part of a job and performs a certain
Hop: A hop is a graphical representation of one or more data streams between 2 steps
Note: a note is a piece of information that can be added to a job
A way of calling transformations and controlling the sequence of their execution. Usually jobs are scheduled in batch mode to be run automatically at regular intervals.
Input StepsOutput Steps
Lookup StepsTransformation Steps
Join StepsDW Steps
Mapping Steps
Job Steps