Designing your BI Architecture - IBM · PDF fileDesigning your BI Architecture ... moving projects, and setting up purging criteria ... DataStage jobs can be imported into data flows

®

IBM Software Group

© 2007 IBM Corporation

Designing your BI Architecture

Data Movement and Transformation

David Cope

EDW Architect – Asia Pacific

IBM Software Group

2

DataStageand

DWE SQW

ERP

IMS

ComplexFiles

XML

OtherDB2 EDW

ETL Engine

SQLScripts

SQLScripts

SQL Scripts

IBM Software Group

3

IBM Information ServerDelivering information you can trust

Understand Cleanse Transform & Move Federate

Information Server

QualityStageInformation Analyzer Federation ServerDataStage

Information Services Director

Metadata Server

Parallel Processing

Rich Connectivity to Applications, Data, and Content

IBM Software Group

4

IBM Information Server Architecture

AnalysisInterface

Web AdminInterface

DevelopmentInterface

UNIFIED USER INTERFACE

COMMON SERVICES

MetadataServices

SecurityServices

Logging &ReportingServices

UNIFIED METADATA

Design Operational

UNIFIED PARALLEL PROCESSING

Understand Cleanse Transform

COMMON CONNECTIVITY

UnifiedService

Deployment

Structured, Unstructured, Applications, Mainframe

Deliver

IBM Software Group

5

Introducing DataStage

• Integrates data from the widest range of

enterprise and external data sources

• Incorporates data validation rules

•Processes and transforms large amounts of data using scalable parallel processing

•Handles very complex transformations

•Manages multiple integration processes

•Provides direct connectivity to enterprise applications as source or targets

•Leverages meta data for analysis and maintenance

•Operates in batch, real time, or as Web Service

Designer Director Administrator Manager

WebSphere DataStage Server

WebSphere DataStage Client

IBM Software Group

6

IBM DataStage Enterprise Edition Components

� Designer

� A design interface used to create WebSphere DataStage applications (known as jobs)

� User: ETL Developer

� Manager

� Used to view and edit the contents of the WebSphere DataStage Repository

� User: ETL Developer

� Administrator

� Used to perform administration tasks such as setting up DataStage users, creating and moving projects, and setting up purging criteria

� User: ETL Administrator

� Director

� Used to validate, schedule, run, and monitor DataStage jobs

� User: ETL Developer \ ETL Operator

Client/Server Development

Environment

IBM Software Group

7

What is Enterprise Edition?

� WebSphere DataStage Enterprise Edition (“EE”) takes performance to a new level, allowing you to handle the massive volume, velocity and variety of data flowing into your organization

� Enterprise Edition provides native parallel processing capabilities, including:

�Near-Linear scalability across parallel hardware environments

� Isolation of Job design from actual runtime resources (H/W, S/W)

�Data Pipelining

�Data Partitioning (including Automatic and Dynamic Re-Partitioning)

�Parallel I/O

�High-Performance, Parallel Sort, Aggregator, Lookup, Join, Merge

�Native (compiled) Parallel Transformer

�Parallel Database interfaces

�more than 50 native parallel stages…

IBM Software Group

8

DataStage Enterprise Edition Architecture

Target(Database or File)

ODBC/Native

DataStage Client

[ Manager, Designer, Director ]

(WinNT or Win2000)

DataStage

Connect API

DataStage Server +

Enterprise Edition

(Win2003/Linux/UNIX/USS)[ Uniprocessors / SMPs /

Clusters / MPPs ]

ODBC/Native

Data Sources

(Database or File)

Data

flow

Data

flow

IBM Software Group

Traditional Batch (ETL) Processing

� Write to disk and read from disk before each processing operation

� Sub-optimal utilization of resources

� a 10 GB stream leads to 70 GB of I/O

� processing resources can sit idle during I/O

� Very complex to manage (lots and lots of small jobs)

� Becomes impractical with big data volumes

� disk I/O consumes the processing

� terabytes of disk required for temporary staging

IBM Software Group

Data Flow Architecture: Data Pipelining

Think of a conveyor belt moving the records from step to step…

� Run each step simultaneously, passing data records

� eg. Transform, Enrich, and Load run simultaneously

� Eliminates intermediate staging to disk

� This also keeps the processors busy

� But pipelining alone still limits overall scalability

IBM Software Group

Combined Partition and Pipeline Parallelism

� Record repartitioning occurs automatically

� No need to repartition data as

� add processors

� change hardware architecture

� Broad range of partitioning methods are available

PIPELINING

IBM Software Group

12

Execution, Production Environment

� Supports all hardware configurations with a single job design

� Scale by simply adding processors or nodes with no application change or re-compilation

� External configuration file specifies hardware configuration and resources

UNLIMITED SCALABILITY

IBM Software Group

13

Job Design vs. Execution

Developer assembles the flow using DataStage Designer

No need to modify or recompile the job design!

… at runtime, this job runs in parallel for any configuration(1 node, 4 nodes, N nodes)

IBM Software Group

14

Job Monitoring and Scheduling

IBM Software Group

15

Job Performance Analysis

A visualization tool which:

� Provides deeper insight into runtime job behavior.

� Offers several categories of visualizations, including:

� Record Throughput

� CPU Utilization

� Job Timing

� Job Memory Utilization

� Physical Machine Utilization

IBM Software Group

16

DataStageand

DWE SQW

ERP

IMS

ComplexFiles

XML

OtherDB2 EDW

IBM Software Group

17

SQL Warehousing Tool (SQW)

� Build and execute intra-warehouse (SQL-based) data movement and

transformation services

� Integrated Development Environment and metadata system

�Model logical flows of higher-level operations

�Generate code and create execution plans

�Test and debug flows

�Package generated code and artifacts into a data warehouse application

� Integrate SQW Flows and DataStage jobs

� Runtime Infrastructure

�Configuration of runtime environments

�Deployment of warehouse applications

�Manage, Execute and Monitor processes and activities

� SQW Flows execute in a DB2 Execution database

� DataStage jobs execute in a DataStage server

IBM Software Group

18

Administration

Execution

WASDIS

Executor

Execution

Plans (EPG)

Production Deployment via Admin Console

Deployment Preparation

Production Ready

Data/Mining Flow Creation in GUI

Deploy

Application (WAS)

Prepare DB

Environment

Create a Deployment Package

Parameterize App, Generate Plans

Define a Warehouse Application

Control Flow Creation in GUI

Design

Statistics, Logging

Schedule Process

DB2 SQL Execution Engine

DataStageExecution

EngineSources

Targets

Manage Processes

Non-W

AS

Desig

n C

enter

Debugger

+ E

xecu

tor

IBM Software Group

19

DWE Components

Design Studio

Control Flow Editor

Data/Mining Flow Editor

Extract SQL JoinSQL

Lookup

DS

subflow

FTP

SQL DF

DS Job DS Job

Email Verify

Websphere

Application

Server

DIS

Run Time

DWE Admin Console

(Web Browser)

DataStage

Server

DB2

MetaData

(Eclipse Modeling Framework)

FF/JDBC

Metadata

IBM Software Group

20

Life Cycle of a SQW Data Warehouse Application

1. Install and set up design and runtime environments

2. Design and validate data flows

3. Test-run data flows

4. Design and validate control flows

5. Test-run control flows

6. Prepare control flow application for deployment

7. Deploy application (from console)

8. Run and manage application at process (control flow) level (from console)

9. Iterate based on changes in source and target databases

� Note: For testing purposes, you can design and run applications from the Design Studio (built-in runtime environment without WebSphere; just need a DB2 instance)

IBM Software Group

21

Data Flows: Definition and Simple Example

� Data flows are models that represent data movement and transformation

requirements

� SQW Codegen translates the models into repeatable, SQL-based

warehouse building processes

� Data from source files and tables moves through a series of transformation

steps then loads or updates a target file or table

� The following example selects data from a DB2 staging table, removes

duplicates, sorts the rows, and inserts the result into another DB2 table.

Discarded duplicates go to a flat file.

IBM Software Group

22

Data Flows: Anatomy

OperatorsSourceTarget

Transfomations

PortsDefines the points

of data input or output for an

operator. Also define the data

layout.

ConnectorsDirects flow of data from an

output port of one operator to the

input port of another operator

Source

Source

Source

Target

Transform

TransformI

O

I

I

II

O

O

O

O

O

IBM Software Group

23

Data Flows: Source and Target Operators

� Sources

�File import

�Table source

�SQL replication source

� Targets

�File export

�Table target (SQL insert, update)

�Bulk load target (DB2 load utility)

�SQL merge (upsert)

�Slowly changing dimension

(SCD)

� Data station

�special staging operator

� intermediate target

IBM Software Group

24

Data Flows: Transform Operators

� Select list (columns and expressions)

� Distinct (similar to a SELECT DISTINCT)

� Where condition (constraints)

� Table join (inner, outer joins supported)

� Group by (aggregations, HAVING clause)

� Order by

� Union (also INTERSECT and EXCEPT)

� Pivot and unpivot

� Key lookup

� Fact key replace

� Sequence (DB2 key generator)

� Splitter

� Custom SQL

� DB2 table function

IBM Software Group

25

Data Flows: Operator Properties

� Properties view for all operators

� Properties for operators and

properties for operator ports

� Properties are duplicated in a wizard

view for object-dependent operators

(table/file sources and targets, data station, etc.)

� Wizard view prompts for object definition but does not require it

� Properties view approach is the

standard Eclipse interface for

defining object attributes

Properties Wizard

Properties View

IBM Software Group

26

Data Flows: Ports and Port Properties

� Operators have input and/or

output ports

� Connections go from upstream

output ports to downstream

input ports

� Ports have properties (virtual

table definitions)

IBM Software Group

27

Data Flows: Column Level Connections

� Connections may need to be made at

the column level

�You might change your mind about a flow definition, delete a connection, or delete an upstream operator

�You do not use all of the attributes that you defined downstream

�You can use column-level connections to refresh or modify the new input schema

IBM Software Group

28

Data Flows: Variables

� Variables can be used in Data Flows

� Defer the definition of certain properties until a later phase in the life cycle.

�File Names

�Table Names

�Database Schema Names

�etc

� Generalize a Data Flow

IBM Software Group

29

Data Flows: Variable Definition and Selection

� Define a variable using the Variable Manager

� Set its properties, current value, and phase

� Phase = when can the value be set during the life

cycle?

� Use the same variable in multiple operators in

different flows

IBM Software Group

30

Data Flows: Validation

� When you save a data flow or validate it, any errors are identified.

� The yellow exclamation marks are warnings; the red Xmarks are serious errors.

� Hover help message text exists for these error conditions; just mouse over the icon.

� Also check the Problems view (next to Properties) to see the errors.

� Validation rules cover a variety of error conditions: missing links and properties, for example

IBM Software Group

31

Data Flows: Data Station Operators

� Staging points in a data flow

� Station types: persistent table, temporary table, view, or file (temp tables and views are dropped after execution)

� Data stations with persistent tables can serve as target operators

� Useful as a recovery mechanism and

as a checkpoint (what does the data set look like at this point in the flow?)

� Pass-through option: switch data

station on and off for different runs

IBM Software Group

32

Data Flows: Subflows

� A subflow is a predefined set of operators that you can place inside a data flow.

� Useful as a plugin into multiple versions of the same or similar data flows

� Containers or building blocks for complex flows (division of labor)

� Blue ports represent subflow inputs and outputs

IBM Software Group

33

Data Flows: Subflows� Subflows consist of input ports and/or output ports and operators

� Where does the subflow fit:

� Input only = subflow at beginning of data flow

�Output only = subflow at end of data flow

� Input and output ports = subflow is mid-flow

� After creating a subflow, drop it into a data flow

� Subflows can be nested

� Data flows can be saved as subflows

� DataStage jobs can be imported into data flows as subflows

IBM Software Group

34

Data Flows: Design Studio Execution

� Validate the flow first and troubleshoot any errors

� Generate and review the code (this is optional)

� Complete the Flow Execution wizard:

�Choose or define the run profile

�Select resources and variable values if required

�Wait for the execution results to be displayed

� Design Studio execution is intended for testing and training purposes

� Deploy applications to DWE Runtime for production runs, scheduling,

and administration

IBM Software Group

35

Data Flows: Testing – Logs and Tracing

Diagnostics tab of Flow Execution wizard

Log file path

Log files can be appended or overwritten

Tip: Tracing performance is not dependent on data input size so tracing time

will be negligable for large data sets.

IBM Software Group

36

Data Flows: Complete Example

IBM Software Group

37

Control Flows: Definition and Simple Example

� A control flow is a container model that sequences one or more data flows and

integrates other data processing rules and activities.

� Data warehouse applications that you deploy to the DWE Runtime Environment depend on control flows

� You cannot deploy data flows independently; wrap them inside a control flow first

� This simple example processes two data flows in sequence. If they fail, e-mail is sent to an administrator:

IBM Software Group

38

Control Flows: Anatomy

OperatorsDefines the type of

activity

PortsDefines the entry and exit points of

an operator.

ConnectorsDirects the

processing flow of control between

operators.

IBM Software Group

39

Control Flows: Ports

Entry

On-Success Exit

Unconditional Exit

On-Failure Exit

Unconditional connection supersedes Conditional connections.

IBM Software Group

40

Control Flows: Ports – Start/End Operators

Entry

Start Process

Process On-Failure

Cleanup Process

Only one Start Operator per Control Flow

Invoked after Activity On-Failure branch, if any

Invoked after reaching the terminal point of any

branch

Optional… but may have multiple as needed

IBM Software Group

41

Control Flows: Operators

� SQW Flow Operators

� Data Flow

� Mining Flow

� Command Operator

� DB2 Shell (OS Scripts)

� DB2 Scripts

� FTP

� Executable

� Control Operators

� File Wait

� Iterator

� End

� Email Operator

� DataStage Operator

� Job Sequence

� Parallel Job

IBM Software Group

42

Control Flows: Iterators

� Data processing loops that iterate over:

�A series of delimited items in a file

�A series of files in a directory

�A fixed number of operations

� For example, a data flow can be executed multiple times inside one control flow,

based on the existence of a set of different input files at runtime.

IBM Software Group

43

Control Flows: Design Studio Execution

� Validate the flow first and troubleshoot any errors

� Generate and review the code (this is optional)

� Complete the Flow Execution wizard:

�Choose or define the run profile

�Select resources and variable values if required

�Wait for the execution results to be displayed

� Design Studio execution is intended for testing and training purposes

� Deploy applications to DWE Runtime for production runs, scheduling, and administration

Code for control flow operators validated/generated sequentially

For sub flows/macros, code is generated every time it is referenced in the data flows

IBM Software Group

44

Control Flows: Command Line Execution

� Execute a data warehouse application process through a command line interface

� A java program that can be invoked outside of WAS

�For example: startSQWInstance -app <application_name> –process <process_name>

�Embeddable inside a user application

� for example, a means to integrate third-party or customized scheduler by invoking a data warehouse process directly from the 3rd-party scheduler application

� Examples of command line interface:

Enable/disable applicationsetSQWApplicationStatus

Enable/disable processsetSQWProcessStatus

Restart an application instancereStartSQWInstance

Start an application processstartSQWInstance –app app_name –process

process_name

Get the list of instances of an applicationgetSQWProcessList –app app_name

Get the list of applications from an application profilegetSQWApplicationList –file filename

Command descriptionCommand name

IBM Software Group

45

Control Flows: Complete Example

IBM Software Group

46

DataStageand

DWE SQW

ERP

IMS

ComplexFiles

XML

OtherDB2 EDW

DataStageand

DWE SQW

IBM Software Group

47

Design Studio

Run Time

Control Flow Editor

Data Flow Editor

Extract SQL JoinSQL

Lookup

DS

subflow

FTP

SQL DF

DS Job DS Job

Email Verify

MetaData (EMF) CodeGen/Optimizer

DWE Admin Console

Websphere

Application Server

DataStage

Server

DB2

Import DataStageJob as visual

Subflow

Export SQL to DataStage as CMD OperatorImport DataStage

Job as opaque Runtime object

Design Studio with DataStage: Integration points

Call DWE Flows directly in DataStageScheduler

IBM Software Group

48

Integrated Tools for Dynamic Warehousing

IBM Information Server

� Seemless integration of DataStage jobs into the SQW environment

IBM Software Group

49

Import capabilities - Subflow

� From the DataStage Designer,

export a DataStage job in XML format

� Bring the job into the Design Studio as a subflow

IBM Software Group

50

Import – Control Flow

� Not really an import, per se

� Ability to execute a DataStage Job or Sequence as a ‘black box’ within an Control Flow

IBM Software Group

51

Export capabilities

� Deploy a data flow as a set of DataStage executables (SQL, XML, and DSX files)

� Open the data flow in the DataStage Designer as a parallel job

IBM Software Group

52

Documents

Designing your BI Architecture - IBM · PDF fileDesigning your BI Architecture ... moving projects, and setting up purging criteria ... DataStage jobs can be imported into data flows