Upload
lizlavaveshkul
View
2.060
Download
3
Embed Size (px)
Citation preview
IBM Ascential ETL IBM Ascential ETL Overview:Overview:
DataStage and Quality DataStage and Quality StageStage
More than ever, businesses today need to understand their operations, customers, suppliers, partners, employees, and stockholders.They need to know what is happening with the business, analyze their operations, reach to market conditions, make the right decisions to drive revenue growth, increase profits and improve productivity and efficiency.
CIOs are responding to their organizations’ strategic needs by developing IT initiatives that align corporate data with business objectives. These initiatives include:
Business intelligence
Master data management
Business transformation
Infrastructure rationalization
Risk and compliance
IBM WebSphere Information Integration platform enables businesses to perform five key integration functions:
Connect to any data or content, wherever it residesConnect to any data or content, wherever it resides Understand and analyze information, including Understand and analyze information, including
relationships and lineagerelationships and lineage Cleanse information to ensure its quality and consistencyCleanse information to ensure its quality and consistency Transform information to provide enrichment and Transform information to provide enrichment and
tailoring for its specific purposestailoring for its specific purposes Federate information to make it accessible to people, Federate information to make it accessible to people,
processes and applicationsprocesses and applications
Data Analysis:Define, annotate, and report on fields of business data.
Data Quality:•Standardize source data fields•Match records across or within data sources, remove duplicate data• Survive records from the best information across sources
Data Transformation & Movement:Move data and transform it to meet the requirements of its target systems
• Integrate data and content• Provide views as if from a single source while maintaining source integrity
Software:• Profile stage in QualityStage
Software:QualityStage
Software:DataStage
Software:N/A (not used at NCEN)Software:
QualityStage
Software:DataStage
This presentation will deal This presentation will deal with ETL with ETL QualityStageQualityStage and and DataStageDataStage..
QualityStageQualityStage
QualityStage is used to cleanse and enrich data to meet business needs and data quality management standards.
Data preparation (often referred to as data cleansing) is critical to the success of an integration project. QualityStage provides a set of integrated modules for accomplishing data reengineering tasks, such as:
• Investigating• Standardizing• Designing and running matches• Determining what data records survive
= data cleansing
QualityStageQualityStageMain QS stages used in the BRM Main QS stages used in the BRM project:project:
InvestigateInvestigate – gives you complete visibility into the actual – gives you complete visibility into the actual condition of datacondition of data (not used in the BRM project because the users (not used in the BRM project because the users really know their data)really know their data)
StandardizeStandardize – – allows you to reformat data from multiple allows you to reformat data from multiple systems to ensure that each data type has the correct and systems to ensure that each data type has the correct and consistent content and formatconsistent content and format
MatchMatch – helps to ensure data integrity by linking records from – helps to ensure data integrity by linking records from one or more data sources that correspond to the same real-world one or more data sources that correspond to the same real-world entity. Matching can be used to identify duplicate entities resulting entity. Matching can be used to identify duplicate entities resulting from data entry variations or account-oriented business practicesfrom data entry variations or account-oriented business practices
SurviveSurvive –– helps to ensure that the best available data survives helps to ensure that the best available data survives and is correctly prepared for the target destinationand is correctly prepared for the target destination
QualityStage QualityStage InvestigateInvestigate Standardize Standardize Match Match
SurviveSurvive
WordWord InvestigationInvestigation parses freeform fields into parses freeform fields into individual tokens, which are analyzed to create individual tokens, which are analyzed to create patterns.patterns.
In addition, Word Investigation provides frequency In addition, Word Investigation provides frequency counts on the tokens. counts on the tokens.
QualityStage QualityStage InvestigateInvestigate Standardize Standardize Match Match
SurviveSurvive
For exampleFor example, to create the patterns in address data:, to create the patterns in address data: Word Investigation uses a set of rules for classifying Word Investigation uses a set of rules for classifying
personal names, business names and addressespersonal names, business names and addresses .. Word Investigation provides prebuilt rule sets for Word Investigation provides prebuilt rule sets for
investigating patterns on names and postal addresses investigating patterns on names and postal addresses for a number of different countriesfor a number of different countries. .
For the United States, the address data would include:USPREP (parses name, address and area if data not previously formatted)USNAME (for individual and organization names)USADDR (for street and mailing addresses)USAREA (for city, state, ZIP code and so on)
QualityStage QualityStage InvestigateInvestigate Standardize Standardize Match Match
SurviveSurvive
Field parsing breaks the address into individual tokens of “123”, “St.”, “Virginia” and “St.”
Example: The test field “123 St. Virginia St.” would be analyzed in the following way:
Lexical analysis determines the business significance of each piece123 = NumberSt. = Street typeVirginia = AlphaSt. = Street type
Context analysis identifies the variations data structures and content as “123 St. Virginia St.”123 = House numberSt. Virginia= Street addressSt. = Street type
QualityStage QualityStage Investigate Investigate StandardizeStandardize Match Match
SurviveSurvive
The The StandardizeStandardize stage allows you stage allows you to reformat data from multiple to reformat data from multiple systems to ensure that each data systems to ensure that each data type has the correct and type has the correct and consistent content and format.consistent content and format.
QualityStage QualityStage Investigate Investigate StandardizeStandardize Match Match
SurviveSurvive
The USNAME rule set is used to standardize First Name, Middle The USNAME rule set is used to standardize First Name, Middle Name, Last NameName, Last Name
The USADDR rule set is used to standardize Address dataThe USADDR rule set is used to standardize Address data The USAREA rules set is used to standardize City, State, Zip The USAREA rules set is used to standardize City, State, Zip
CodeCode The VTAXID rule set is used to validate Social Security NumberThe VTAXID rule set is used to validate Social Security Number The VEMAIL rule set is used to validate Email AddressThe VEMAIL rule set is used to validate Email Address The VPHONE rule set is used to validate Work Phone NumberThe VPHONE rule set is used to validate Work Phone Number
The list below shows some of the more commonly-used Rule Sets.
Standardization is used to invoke specific standardization Rule Sets and standardize one or more fields using that Rule Set. Standardization is used to invoke specific standardization Rule Sets and standardize one or more fields using that Rule Set. For example, a Rule Set can be used so that “Boulevard” will always be “Blvd”
Standardization is used to invoke specific standardization Rule Sets and standardize one or more fields using that Rule Set. For example, a Rule Set can be used so that “Boulevard” will always be “Blvd”, not “Boulevard”, “Blv.”, “Boulev”, or some other variation.
QualityStage QualityStage Investigate Investigate Standardize Standardize MatchMatch
SurviveSurvive
Data matching is used to find records in a single data source or independent data sourcesData matching is used to find records in a single data source or independent data sources that refer to the same entity
Data matching is used to find records in a single data source or independent data sources that refer to the same entity (such as a person, organization, location, product, or material) regardless of the availability of a predetermined key.
BlockingBlocking MatchingMatching
QualityStage Matching Stage basically consists of two steps:
QualityStage QualityStage Investigate Investigate Standardize Standardize MatchMatch
SurviveSurvive
ETL QualityStage:Matching Stage
A simplifiedexplanation of theMatching Stage
XA = master record (during the first pass, this was the first record found to match with another record)
DA = duplicates CP = clerical procedure (records with a
weighting within a set cutoff range) RA = residuals (those records that
remain isolated)
Operations in the Matching module:
2. Processing Files
1. Unduplication
Match Fields Suspect Match Values by Match Pass Vartypes Cutoff Weights
1. Unduplication (group records into sets having similar attributes)
QualityStage QualityStage Investigate Investigate Standardize Standardize MatchMatch
SurviveSurvive
Unduplication: Weights
A simplified example
QualityStage QualityStage Investigate Investigate Standardize Standardize MatchMatch
SurviveSurvive
SurvivorshipSurvivorship is used to create a ‘best record’ from all available is used to create a ‘best record’ from all available information about an entity (such as a person, location, information about an entity (such as a person, location, material, etc.). Survivorship and formatting ensure that the material, etc.). Survivorship and formatting ensure that the best available data survives and is correctly prepared for the best available data survives and is correctly prepared for the target destination.target destination. Using the rules setup screen, it implements business and mapping rules, creating the necessary output structures for the target application and identifying fields that do not conform to load standards.
QualityStage QualityStage Investigate Investigate Standardize Standardize MatchMatch
SurviveSurvive
Supplies missing values in one record with values Supplies missing values in one record with values from other records on the same entityfrom other records on the same entity
Populates missing values in one record with values Populates missing values in one record with values from corresponding records that have been from corresponding records that have been identified as a group in the matching stageidentified as a group in the matching stage
Enriches existing data with external dataEnriches existing data with external data
The Survive stage does the following:
DataStageDataStage = data transformation
DataStageDataStage
In its simplest form, DataStage performs data transformation and movement from source systems to target systems in batch and in real time.
The data sources may include indexed files, sequential files, relational databases, archives, external data sources, enterprise applications and message queues.
DataStageDataStage
DataStage AdministratorDataStage Administrator
DataStage ManagerDataStage Manager
DataStage DesignerDataStage Designer
DataStage DirectorDataStage Director
The DataStage client components are:
Specify general server Specify general server defaultsdefaults
Add and delete projectsAdd and delete projects Set project propertiesSet project properties Access DataStage Access DataStage
Repository by command Repository by command interfaceinterface
DataStage DataStage AdministratorAdministrator Manager Manager DesignerDesigner DirectorDirector
Use DataStage Administrator to:
DataStage DataStage AdministratorAdministrator Manager Manager DesignerDesigner DirectorDirector
DataStage DataStage AdministratorAdministrator ManagerManager DesignerDesigner DirectorDirector
DataStage Manager is the primary interface to the DataStage repository.
In addition to table and file layouts, it displays the routines, transforms, and jobs that are defines in the project. It also allows us to move or copy ETL jobs from one project to another.
DataStage DataStage Administrator Administrator Manager Manager DesignerDesigner DirectorDirector
Specify how the data Specify how the data is extractedis extracted
Specify data Specify data transformationstransformations
Decode (denormalize) Decode (denormalize) data going into the data going into the data mart using data mart using referenced lookupsreferenced lookups
Aggregate dataAggregate data Split data into Split data into
multiple outputs on multiple outputs on the basis of defined the basis of defined constraintsconstraints
Use DataStage Designer to:
DataStage DataStage Administrator Administrator Manager Manager DesignerDesigner DirectorDirector
Use DataStage Director to run, schedule, and monitor your DataStage jobs. You can also gather statistics as the job runs. Also used for looking at logs for debugging purposes.
DataStage: DataStage: Getting StartedGetting Started
Set up a projectSet up a project – Before you can create any – Before you can create any DataStage jobs, you must set up your project DataStage jobs, you must set up your project by entering information about your data.by entering information about your data.
Create a jobCreate a job – When a DataStage project is – When a DataStage project is installed, it is empty and you must create the installed, it is empty and you must create the jobs you need in DataStage Designer.jobs you need in DataStage Designer.
Define Table DefinitionsDefine Table Definitions Develop the jobDevelop the job – Jobs are designed and – Jobs are designed and
developed using the Designer. Each data developed using the Designer. Each data source, the data warehouse, and each source, the data warehouse, and each processing step is represented by a processing step is represented by a stagestage in in the job design. The stages are linked the job design. The stages are linked together to show the flow of data.together to show the flow of data.
DataStage DataStage DesignerDesigner Developing a jobDeveloping a job
DataStage DataStage DesignerDesigner Developing a jobDeveloping a job
DataStage DataStage DesignerDesigner Input StageInput Stage
DataStage DataStage DesignerDesigner Transformer StageTransformer Stage
The Transformer stage performs any data conversion required before the data is output to another stage in the job design.
After you are done, compile and run the job.
DataStage DataStage DesignerDesigner
DataStage DataStage DesignerDesigner
DataStage DataStage DesignerDesigner
DataStage DataStage DesignerDesigner
DataStage DataStage
T10 takes .txt files from the Pre-event folder and transforms them into rows.
Straight_moves moves the files into the stg_file_contact table, stg_file_broker table, or the reject file.
– If it says “lead source”, it will go to the reject file (constraint).– If it does not say “lead source”, it will evaluate the entire row to
determine whether it will go to the contact or broker table (derivation).
An example: Preventing the header row from inserting into MDM_Contact and MDM_Broker
Questions?Questions?
Thank you for Thank you for attendingattending