Upload
ledung
View
217
Download
0
Embed Size (px)
Citation preview
®
IBM Software Group
© 2007 IBM Corporation
Designing your BI Architecture
Data Movement and Transformation
David Cope
EDW Architect – Asia Pacific
IBM Software Group
2
DataStageand
DWE SQW
ERP
IMS
ComplexFiles
XML
OtherDB2 EDW
ETL Engine
SQLScripts
SQLScripts
SQL Scripts
IBM Software Group
3
IBM Information ServerDelivering information you can trust
Understand Cleanse Transform & Move Federate
Information Server
QualityStageInformation Analyzer Federation ServerDataStage
Information Services Director
Metadata Server
Parallel Processing
Rich Connectivity to Applications, Data, and Content
IBM Software Group
4
IBM Information Server Architecture
AnalysisInterface
Web AdminInterface
DevelopmentInterface
UNIFIED USER INTERFACE
COMMON SERVICES
MetadataServices
SecurityServices
Logging &ReportingServices
UNIFIED METADATA
Design Operational
UNIFIED PARALLEL PROCESSING
Understand Cleanse Transform
COMMON CONNECTIVITY
UnifiedService
Deployment
Structured, Unstructured, Applications, Mainframe
Deliver
IBM Software Group
5
Introducing DataStage
• Integrates data from the widest range of
enterprise and external data sources
• Incorporates data validation rules
•Processes and transforms large amounts of data using scalable parallel processing
•Handles very complex transformations
•Manages multiple integration processes
•Provides direct connectivity to enterprise applications as source or targets
•Leverages meta data for analysis and maintenance
•Operates in batch, real time, or as Web Service
Designer Director Administrator Manager
WebSphere DataStage Server
WebSphere DataStage Client
IBM Software Group
6
IBM DataStage Enterprise Edition Components
� Designer
� A design interface used to create WebSphere DataStage applications (known as jobs)
� User: ETL Developer
� Manager
� Used to view and edit the contents of the WebSphere DataStage Repository
� User: ETL Developer
� Administrator
� Used to perform administration tasks such as setting up DataStage users, creating and moving projects, and setting up purging criteria
� User: ETL Administrator
� Director
� Used to validate, schedule, run, and monitor DataStage jobs
� User: ETL Developer \ ETL Operator
Client/Server Development
Environment
IBM Software Group
7
What is Enterprise Edition?
� WebSphere DataStage Enterprise Edition (“EE”) takes performance to a new level, allowing you to handle the massive volume, velocity and variety of data flowing into your organization
� Enterprise Edition provides native parallel processing capabilities, including:
�Near-Linear scalability across parallel hardware environments
� Isolation of Job design from actual runtime resources (H/W, S/W)
�Data Pipelining
�Data Partitioning (including Automatic and Dynamic Re-Partitioning)
�Parallel I/O
�High-Performance, Parallel Sort, Aggregator, Lookup, Join, Merge
�Native (compiled) Parallel Transformer
�Parallel Database interfaces
�more than 50 native parallel stages…
IBM Software Group
8
DataStage Enterprise Edition Architecture
Target(Database or File)
ODBC/Native
DataStage Client
[ Manager, Designer, Director ]
(WinNT or Win2000)
DataStage
Connect API
DataStage Server +
Enterprise Edition
(Win2003/Linux/UNIX/USS)[ Uniprocessors / SMPs /
Clusters / MPPs ]
ODBC/Native
Data Sources
(Database or File)
Data
flow
Data
flow
IBM Software Group
Traditional Batch (ETL) Processing
� Write to disk and read from disk before each processing operation
� Sub-optimal utilization of resources
� a 10 GB stream leads to 70 GB of I/O
� processing resources can sit idle during I/O
� Very complex to manage (lots and lots of small jobs)
� Becomes impractical with big data volumes
� disk I/O consumes the processing
� terabytes of disk required for temporary staging
IBM Software Group
Data Flow Architecture: Data Pipelining
Think of a conveyor belt moving the records from step to step…
� Run each step simultaneously, passing data records
� eg. Transform, Enrich, and Load run simultaneously
� Eliminates intermediate staging to disk
� This also keeps the processors busy
� But pipelining alone still limits overall scalability
IBM Software Group
Combined Partition and Pipeline Parallelism
� Record repartitioning occurs automatically
� No need to repartition data as
� add processors
� change hardware architecture
� Broad range of partitioning methods are available
PIPELINING
IBM Software Group
12
Execution, Production Environment
� Supports all hardware configurations with a single job design
� Scale by simply adding processors or nodes with no application change or re-compilation
� External configuration file specifies hardware configuration and resources
UNLIMITED SCALABILITY
IBM Software Group
13
Job Design vs. Execution
Developer assembles the flow using DataStage Designer
No need to modify or recompile the job design!
… at runtime, this job runs in parallel for any configuration(1 node, 4 nodes, N nodes)
IBM Software Group
14
Job Monitoring and Scheduling
IBM Software Group
15
Job Performance Analysis
A visualization tool which:
� Provides deeper insight into runtime job behavior.
� Offers several categories of visualizations, including:
� Record Throughput
� CPU Utilization
� Job Timing
� Job Memory Utilization
� Physical Machine Utilization
IBM Software Group
16
DataStageand
DWE SQW
ERP
IMS
ComplexFiles
XML
OtherDB2 EDW
IBM Software Group
17
SQL Warehousing Tool (SQW)
� Build and execute intra-warehouse (SQL-based) data movement and
transformation services
� Integrated Development Environment and metadata system
�Model logical flows of higher-level operations
�Generate code and create execution plans
�Test and debug flows
�Package generated code and artifacts into a data warehouse application
� Integrate SQW Flows and DataStage jobs
� Runtime Infrastructure
�Configuration of runtime environments
�Deployment of warehouse applications
�Manage, Execute and Monitor processes and activities
� SQW Flows execute in a DB2 Execution database
� DataStage jobs execute in a DataStage server
IBM Software Group
18
Administration
Execution
WASDIS
Executor
Execution
Plans (EPG)
Production Deployment via Admin Console
Deployment Preparation
Production Ready
Data/Mining Flow Creation in GUI
Deploy
Application (WAS)
Prepare DB
Environment
Create a Deployment Package
Parameterize App, Generate Plans
Define a Warehouse Application
Control Flow Creation in GUI
Design
Statistics, Logging
Schedule Process
DB2 SQL Execution Engine
DataStageExecution
EngineSources
Targets
Manage Processes
Non-W
AS
Desig
n C
enter
Debugger
+ E
xecu
tor
IBM Software Group
19
DWE Components
Design Studio
Control Flow Editor
Data/Mining Flow Editor
Extract SQL JoinSQL
Lookup
DS
subflow
FTP
SQL DF
DS Job DS Job
Email Verify
Websphere
Application
Server
DIS
Run Time
DWE Admin Console
(Web Browser)
DataStage
Server
DB2
MetaData
(Eclipse Modeling Framework)
FF/JDBC
Metadata
IBM Software Group
20
Life Cycle of a SQW Data Warehouse Application
1. Install and set up design and runtime environments
2. Design and validate data flows
3. Test-run data flows
4. Design and validate control flows
5. Test-run control flows
6. Prepare control flow application for deployment
7. Deploy application (from console)
8. Run and manage application at process (control flow) level (from console)
9. Iterate based on changes in source and target databases
� Note: For testing purposes, you can design and run applications from the Design Studio (built-in runtime environment without WebSphere; just need a DB2 instance)
IBM Software Group
21
Data Flows: Definition and Simple Example
� Data flows are models that represent data movement and transformation
requirements
� SQW Codegen translates the models into repeatable, SQL-based
warehouse building processes
� Data from source files and tables moves through a series of transformation
steps then loads or updates a target file or table
� The following example selects data from a DB2 staging table, removes
duplicates, sorts the rows, and inserts the result into another DB2 table.
Discarded duplicates go to a flat file.
IBM Software Group
22
Data Flows: Anatomy
OperatorsSourceTarget
Transfomations
PortsDefines the points
of data input or output for an
operator. Also define the data
layout.
ConnectorsDirects flow of data from an
output port of one operator to the
input port of another operator
Source
Source
Source
Target
Transform
TransformI
O
I
I
II
O
O
O
O
O
IBM Software Group
23
Data Flows: Source and Target Operators
� Sources
�File import
�Table source
�SQL replication source
� Targets
�File export
�Table target (SQL insert, update)
�Bulk load target (DB2 load utility)
�SQL merge (upsert)
�Slowly changing dimension
(SCD)
� Data station
�special staging operator
� intermediate target
IBM Software Group
24
Data Flows: Transform Operators
� Select list (columns and expressions)
� Distinct (similar to a SELECT DISTINCT)
� Where condition (constraints)
� Table join (inner, outer joins supported)
� Group by (aggregations, HAVING clause)
� Order by
� Union (also INTERSECT and EXCEPT)
� Pivot and unpivot
� Key lookup
� Fact key replace
� Sequence (DB2 key generator)
� Splitter
� Custom SQL
� DB2 table function
IBM Software Group
25
Data Flows: Operator Properties
� Properties view for all operators
� Properties for operators and
properties for operator ports
� Properties are duplicated in a wizard
view for object-dependent operators
(table/file sources and targets, data station, etc.)
� Wizard view prompts for object definition but does not require it
� Properties view approach is the
standard Eclipse interface for
defining object attributes
Properties Wizard
Properties View
IBM Software Group
26
Data Flows: Ports and Port Properties
� Operators have input and/or
output ports
� Connections go from upstream
output ports to downstream
input ports
� Ports have properties (virtual
table definitions)
IBM Software Group
27
Data Flows: Column Level Connections
� Connections may need to be made at
the column level
�You might change your mind about a flow definition, delete a connection, or delete an upstream operator
�You do not use all of the attributes that you defined downstream
�You can use column-level connections to refresh or modify the new input schema
IBM Software Group
28
Data Flows: Variables
� Variables can be used in Data Flows
� Defer the definition of certain properties until a later phase in the life cycle.
�File Names
�Table Names
�Database Schema Names
�etc
� Generalize a Data Flow
IBM Software Group
29
Data Flows: Variable Definition and Selection
� Define a variable using the Variable Manager
� Set its properties, current value, and phase
� Phase = when can the value be set during the life
cycle?
� Use the same variable in multiple operators in
different flows
IBM Software Group
30
Data Flows: Validation
� When you save a data flow or validate it, any errors are identified.
� The yellow exclamation marks are warnings; the red Xmarks are serious errors.
� Hover help message text exists for these error conditions; just mouse over the icon.
� Also check the Problems view (next to Properties) to see the errors.
� Validation rules cover a variety of error conditions: missing links and properties, for example
IBM Software Group
31
Data Flows: Data Station Operators
� Staging points in a data flow
� Station types: persistent table, temporary table, view, or file (temp tables and views are dropped after execution)
� Data stations with persistent tables can serve as target operators
� Useful as a recovery mechanism and
as a checkpoint (what does the data set look like at this point in the flow?)
� Pass-through option: switch data
station on and off for different runs
IBM Software Group
32
Data Flows: Subflows
� A subflow is a predefined set of operators that you can place inside a data flow.
� Useful as a plugin into multiple versions of the same or similar data flows
� Containers or building blocks for complex flows (division of labor)
� Blue ports represent subflow inputs and outputs
IBM Software Group
33
Data Flows: Subflows� Subflows consist of input ports and/or output ports and operators
� Where does the subflow fit:
� Input only = subflow at beginning of data flow
�Output only = subflow at end of data flow
� Input and output ports = subflow is mid-flow
� After creating a subflow, drop it into a data flow
� Subflows can be nested
� Data flows can be saved as subflows
� DataStage jobs can be imported into data flows as subflows
IBM Software Group
34
Data Flows: Design Studio Execution
� Validate the flow first and troubleshoot any errors
� Generate and review the code (this is optional)
� Complete the Flow Execution wizard:
�Choose or define the run profile
�Select resources and variable values if required
�Wait for the execution results to be displayed
� Design Studio execution is intended for testing and training purposes
� Deploy applications to DWE Runtime for production runs, scheduling,
and administration
IBM Software Group
35
Data Flows: Testing – Logs and Tracing
Diagnostics tab of Flow Execution wizard
Log file path
Log files can be appended or overwritten
Tip: Tracing performance is not dependent on data input size so tracing time
will be negligable for large data sets.
IBM Software Group
36
Data Flows: Complete Example
IBM Software Group
37
Control Flows: Definition and Simple Example
� A control flow is a container model that sequences one or more data flows and
integrates other data processing rules and activities.
� Data warehouse applications that you deploy to the DWE Runtime Environment depend on control flows
� You cannot deploy data flows independently; wrap them inside a control flow first
� This simple example processes two data flows in sequence. If they fail, e-mail is sent to an administrator:
IBM Software Group
38
Control Flows: Anatomy
OperatorsDefines the type of
activity
PortsDefines the entry and exit points of
an operator.
ConnectorsDirects the
processing flow of control between
operators.
IBM Software Group
39
Control Flows: Ports
Entry
On-Success Exit
Unconditional Exit
On-Failure Exit
Unconditional connection supersedes Conditional connections.
IBM Software Group
40
Control Flows: Ports – Start/End Operators
Entry
Start Process
Process On-Failure
Cleanup Process
Only one Start Operator per Control Flow
Invoked after Activity On-Failure branch, if any
Invoked after reaching the terminal point of any
branch
Optional… but may have multiple as needed
IBM Software Group
41
Control Flows: Operators
� SQW Flow Operators
� Data Flow
� Mining Flow
� Command Operator
� DB2 Shell (OS Scripts)
� DB2 Scripts
� FTP
� Executable
� Control Operators
� File Wait
� Iterator
� End
� Email Operator
� DataStage Operator
� Job Sequence
� Parallel Job
IBM Software Group
42
Control Flows: Iterators
� Data processing loops that iterate over:
�A series of delimited items in a file
�A series of files in a directory
�A fixed number of operations
� For example, a data flow can be executed multiple times inside one control flow,
based on the existence of a set of different input files at runtime.
IBM Software Group
43
Control Flows: Design Studio Execution
� Validate the flow first and troubleshoot any errors
� Generate and review the code (this is optional)
� Complete the Flow Execution wizard:
�Choose or define the run profile
�Select resources and variable values if required
�Wait for the execution results to be displayed
� Design Studio execution is intended for testing and training purposes
� Deploy applications to DWE Runtime for production runs, scheduling, and administration
Code for control flow operators validated/generated sequentially
For sub flows/macros, code is generated every time it is referenced in the data flows
IBM Software Group
44
Control Flows: Command Line Execution
� Execute a data warehouse application process through a command line interface
� A java program that can be invoked outside of WAS
�For example: startSQWInstance -app <application_name> –process <process_name>
�Embeddable inside a user application
� for example, a means to integrate third-party or customized scheduler by invoking a data warehouse process directly from the 3rd-party scheduler application
� Examples of command line interface:
Enable/disable applicationsetSQWApplicationStatus
Enable/disable processsetSQWProcessStatus
Restart an application instancereStartSQWInstance
Start an application processstartSQWInstance –app app_name –process
process_name
Get the list of instances of an applicationgetSQWProcessList –app app_name
Get the list of applications from an application profilegetSQWApplicationList –file filename
Command descriptionCommand name
IBM Software Group
45
Control Flows: Complete Example
IBM Software Group
46
DataStageand
DWE SQW
ERP
IMS
ComplexFiles
XML
OtherDB2 EDW
DataStageand
DWE SQW
IBM Software Group
47
Design Studio
Run Time
Control Flow Editor
Data Flow Editor
Extract SQL JoinSQL
Lookup
DS
subflow
FTP
SQL DF
DS Job DS Job
Email Verify
MetaData (EMF) CodeGen/Optimizer
DWE Admin Console
Websphere
Application Server
DataStage
Server
DB2
Import DataStageJob as visual
Subflow
Export SQL to DataStage as CMD OperatorImport DataStage
Job as opaque Runtime object
Design Studio with DataStage: Integration points
Call DWE Flows directly in DataStageScheduler
IBM Software Group
48
Integrated Tools for Dynamic Warehousing
IBM Information Server
� Seemless integration of DataStage jobs into the SQW environment
IBM Software Group
49
Import capabilities - Subflow
� From the DataStage Designer,
export a DataStage job in XML format
� Bring the job into the Design Studio as a subflow
IBM Software Group
50
Import – Control Flow
� Not really an import, per se
� Ability to execute a DataStage Job or Sequence as a ‘black box’ within an Control Flow
IBM Software Group
51
Export capabilities
� Deploy a data flow as a set of DataStage executables (SQL, XML, and DSX files)
� Open the data flow in the DataStage Designer as a parallel job
IBM Software Group
52