Upload
harjeet-bakshi
View
47
Download
1
Embed Size (px)
DESCRIPTION
informatica labs
Citation preview
Lab 0 Informatica Data Quality setup
DATA QUALITY BOOTCAMP
DATA QUALITY 9.1 Training
IDQ 9.1 Labs2Lab 1 - Content Management Service
6Lab 2 - New Reference Table Capabilities
7Lab 3 - Content Sets
10Lab 4 - Tags
12Lab 5 - Match Enhancements
13Lab 6 New Exception Transform
16Lab 7 - Data Quality for MS Excel
17Lab 8 - Profiling Labs
Lab 1 - Content Management Service
Create and Configure CMS
Objective: Configure AddressDoctor options and check AV reference file status from Developer
Steps
Open Administrator Console
Select Action/New/Content Management Service
Follow Wizard to create and start the CMS
Open CMS, processes tab
Edit AV Options
AV Licence: S0PCF4MN94L7ZXEZ635NCSM90NZKR0NJUTWA Set No Pre-Load to ALL for all types
Set AV Reference data path to: C:\Informatica\9.1.0\services\DQContent\INFA_Content\av\default
Recycle CMS Service
Recycle DIS Service
Open Developer
Select Window / Preferences
Select Content Status
Check Status is displayed correctly (expected view below)
Lab 2 - New Reference Table Capabilities
Create Managed Reference Table from database
Objective: Create new reference table using a database source
Steps
Open Analyst Tool
Select Create New Reference Table
Select Connect to a Relational Table
Select DQ_Tables Connection
Select fname table
Select Column1 as valid value
Save as fname in Your Project
Create Unmanaged Reference Table from database
Objective: Create new unmanaged reference table
Steps
Open Analyst Tool
Select Create New Reference Table
Select Connect to a Relational Table
Make sure Unmanaged Table is ticked
Select DQ_Tables Connection
Select us_states table
Select Column1 as valid value
Save as us_states in Your Project
End of Exercise
Lab 3 - Content Sets
Create and configure new Content Set
Objective: Create new content set and content set expressions
Steps
Open Developer
Select File / New / Content Set
Create new Content Set called ContentSet_91 in Your project
Open your content set
Add a new
Add new Character Set:
Name: Char
Label : C
Range: a-z and A-Z
Add a new Regular Expression:
Name: num
Number of Outputs: 1
RegEx: ^[0-9]+$
Add a new Token Sets (RegEx):
Name: date
Label: date
RegEx: ^\d{1,2}\/\d{1,2}\/\d{4}$
Description: matches dates of the form XX/XX/YYYY where XX can be 1 or 2 digits long and YYYY is always 4 digits long.
Save Content Set
Use a Content SetParse, Cleanse and Standardize Data
Objective: Prepare data source for upload to Warehouse and matching scenarios
Steps
Create New Mapping: m_process_customer_data
Add customer Flat file source from c:\DQ_DATA directory Add Parser Token Parser
Add Input Port contact_name from source
Create new strategy
Name: parse_names
Operation 1:
Operation: Parse Using Reference Table
Name: parse_fname
Reference Table: fnames (Enablement_91 project)
Output: fname, string, 25
Operation 2:
Operation: Parse Using Reference Table
Name: parse_sname
Reference Table: usa_surnames_infa (Informatica_DQ_Content/Dictionaries/North America/USA)
Output: sname, string, 25
Add Input port address1 from source
Create new strategy
Name: parse_housenum
Operation 1:
Operation: Parse Using Token Set
Select Regular Expression
Choose RegEx num from ContentSet_91
Create new output house_num
Run data viewer and examine your results
Add Labeler
Add Input Port address3
Create New Strategy
Name: label_state
Mode: Token
Operation: Label with Reference Table
Reference Table: us_states (Your project)
Label: state]
Add Labeler ( or use exiting one)
Add Input Port cust_start_date from source
Create new strategy
Name: lbl_date
Mode: Token
Operation 1:
Operation: Label Tokens with Token Set
Name: lbl_date
Select Token Set date from ContentSet_91
Add Input Port currency from source
Create new strategy
Name: lbl_currency
Mode: Token
Operation 1:
Operation: Label Tokens with reference table
Informatica_DQ_Content/dictionaries/general/currency_codes_infa
Name: lbl_currency
Run the data viewer and examine your results
Should look something like this:
Lab 4 - Tags
Create and associate new tags - Developer
Objective: Create new tags and associate to objects in Developer
Create Tag Steps
Open Developer
Open Window / Preferences
Select Tags
View Out of the Box Tags (These will appear when you install 9.1 accelerators, which this image does not have) Create new tags:
Customer
Product
Content
Associate Tag Steps
Open Developer
Apply Tags to Data Sources
Open Source
Navigate to Tags View
Select Edit
Apply Tag
Apply Tags to content set expressions
Open Content Set
Navigate to Tags View
Select Edit
Apply Content Tag to all elements
Create and associate new tags - Analyst
Objective: Create new tags and associate to objects in Analyst
Create Tag Steps
Open Analyst
Select Actions / Show Tags
Create new tag:
Order
Address
Associate Tag Steps
Open Profile_order
(You probably have not already profiled the order table. Create a data object using the flat file order in the c:\DQ_DATA directory and profile it, columns only.) Apply Tags to data columns
Show Tags view
Select Address related columns
Apply Address Tag
Select Order related columns
Apply Order Tag
Select Customer name related columns
Apply Customer Tag
Go to project view
Select RTM fnames
Apply Customer and Content TagsLab 5 - Match Enhancements
Pre-req for image
Copy 3 .ysp files from C:\Informatica\9.1.0\services\DQContent\INFA_Content\identity\
To a newly created folder called default at
C:\Informatica\9.1.0\services\DQContent\INFA_Content\identity\default
Key Gen and Match Analysis
Objective: Identify potential duplicates
Steps
Create New Mapping: m_match_customer
Add c:\DQ_Data\aml_demo_data source
Add Key Generator
Use String strategy on iso_ctry_code
right click on the key generator and Select Analyse Detail from the menu
Review the following information:
Estimated processing time
Groups above the recommended threshold
Edit the desired throughput value and observe how estimated processing time changes
Edit min and max group size values
Select groups above the threshold from the dropdown list and drilldown to the record level
Re-configure Key Generator to zip_or_postcode port
Run GroupKey Analysis again and observe the results
Add Match transform and configure as follows
Field Matching (Single Source)
Edit Distance on contact_name
Threshold 0.6
Select Runtime Analysis by MatchType from the right click menu
Review results
Select Output Analysis of Clusters from the right click menu
Review results (There is a bug in this Beta release so you may get funny results) Repeat the above steps for a Match transform configured for Identity Matching
Lab 6 New Exception Transform
Objective: Identify and manually correct exception records
Think of this as further down in a mapping where you have already run data cleansing and address validation. You have also run a data quality check on the phone field. Now that youve done that, you need to decide what records pass and which need manual intervention.
Steps
Create New Mapping: m_exception_records
Add c:\DQ_data\cleansed_customer source
Add Decision
Assign a score of 60 to records with AddressStatus Incomplete Address or Invalid Address Line and Phone Status Incomplete Phone
Assign a score of 90 to all remaining records
if AddressStatus= 'Incomplete Address'or AddressStatus='Invalid Address Line'or PhoneStatus='Incomplete Phone'thenscore:=60elsescore:=90endif Add Exception transform and configure as follows
Bad Records Exception
Table BAD RECORDS in Staging DB
Connect data ports
Add AddressStatus and PhoneStatus ports to Labels input
Connect the score port to Inputs >> Control >> Score.
Records with a score between 40 and 90 to be reviewed manually
Send good records to standard output and bad records to Bad Records table
Map AddressStatus and PhoneStatus ports to respective issues in the priority tab
Run the mapping (Data Viewer) Run the data viewer on the Exception transform In Analyst, add the newly created table to the project view
Review available filters
Select Edit mode
Correct a number of records and select Save All Corrected Records
Open the audit trail and review the changes made
Hover over the new value to see the old value
Lab 7 - Data Quality for MS Excel
Install DQ for Excel
Objective: Perform base install of DQ for Excel
Steps
Extract lwp3.zip (from lab files provided) to Desktop
Close Excel
Run setup.exe
Note this will connect to internet to download base Excel DLLs required to run Add-In
Once complete, open Excel
Check Informatica Ribbon is available
Use DQ for Excel
Objective: Add new service to DQ for Excel and use with sample data
Steps
Export WSDL from service created in DQ/WS lab
Open Excel
Select Informatica Ribbon and Add Service
Point to WSDL file extracted from first step
Open customer.xlsx (see Desktop on 9.1 TTT VM, folder DQ_data_enablement91)
Use imported service to parse Name column into First and Last names
Lab 8 - Profiling Labs
Profiling
1. Add 3 Data Objects (Location: c:\DQ_DATA) Tool Matters (make sure you click the because there are embedded commas in the data)a. Customer Orders
b. Product
2. Profile Customer Columns only
3. Delete profile
4. Profile All the tables at once
a. 2 ways
i. Select all the data objects and profile
ii. Profile one add the other two
1. Add a Prefix of DW_ to the objects
5. Profile Customer Column, Primary Key and Dependency
a. Take all the defaults (but keep hitting next not finish to see them)
Remember to select the columns for PK and Functional Dependency
b. View PK Results
c. Select Cust_number verify
i. What happened to the display
d. View Violations
e. Select cust_Number 15952672 and Drill down
i. What is the difference between the records?
f. View Functional Dependencies
i. Why are there blank determinants?
g. Select a column and verify
i. What happened in the Display?
6. Profile Orders All three (Column, Primary Key and Dependency) and take defaults
a. View column profile
i. What is the Key to this table?
b. View Primary Key inference
i. What is the key?
c. View the Functional Dependencies
i. Can you identify any potential Sub-tables
7. Delete the orders profile
8. Re-profile orders This time override and change the default options
Hint: Be careful in changes to these options otherwise you will be here till Saturday
a. Primary Key Minimum Percent = 100
b. Dependency - Minimum Percent = 100
i. View results
ii. Sort on the determinant column
iii. Can you identify a key?
iv. Can you identify a potential sub-table?
9. Go back and modify the profile description (PD) to change the minimum to 75
i. What does this show you?
10. Delete the profile
11. Profile orders dependency only
a. Exclude all columns from determinants except sales_id
b. Verify all fields that show 100%
12. Go back and modify the profile definition to profile all three tasks
a. Did it work?
13. Profile orders all three tasks
a. Verify the primary key
b. Look at dependencies
i. Why are form, ingredient_list, on_hand and segment not determined by the primary key
ii. How can I fix segment and form to show they are determined by the primary key?
Filters
1. Delete Customer profile
2. Profile customer adding filters
a. Address3 = NY
b. Address3 = CA
c. Address3 = NY and address2 = NEW YORK
3. Run each profile, view results, modify Profile definition and run the next one.
Drilldowns
1. Open the Customer Profile in the Analyst tool
2. Remove any active filters and re-run the profile
3. Go to address3 and drill down on NY
4. Edit the drill down filter and add address3 = NEW YORK (yes case matters)
5. Add another field Zip_or_postcode < 10020
6. Add to the filter, iso_ctry_code is not null or != USA or != U.S.A.
Modeling
1. Create a Profile model called Whatever with orders, customer and product
a. Select the customer object, right click and data object profile
b. Profile all 3 steps setting options you think are appropriate
c. View results
d. In Primary Key inference, right click and add cust_number to model
2. Go back to the default view and create and run a data object profile for orders
a. View results
b. Add item+order to the model
3. Go back to the default view and create and run a data object profile for product
a. Make sure you verify all your keys
b. Add product_id to the model
4. Select customer and orders and profile foreign keys
a. Verify and add relationship to the model
b. Re-select the relationship and view the Venn diagram
c. Double click on the non-overlapping orders.
i. See orders without a valid customer ID
ii. Double click on customer and see all the deadbeat customers.
5. Go back to the default view and do the same for orders and product changing the inference options (Trim values and case sensitivity)
a. Verify the relationship (if it makes sense)
b. Re-select the relationship and view the Venn diagram.
i. Find the products that dont have open orders.
ii. Find the most ordered products.
6. FINALLY (well almost), select customer and do a join profile
a. Add orders
b. Add a join on cust_number and GCLOC
Generating a Mapping from a Profile
1. Open the Customer Profile
2. Add an OOTB rule to validate cust_start_date
a. You may want to copy it and change the date format if you want it to actually work correctly.
3. Add an OOTB rule to validate Last_Order_Date
a. See above
4. Create a rule to validate iso_ctry_code (US is only valid value)
a. IIF(iso_ctry_code ='US','Valid','Invalid')
5. Add an OOTB rule to remove punctuation from contact_name
6. Add more as appropriate or stop to get out early.
7. Run the profile (anything interesting?)
8. Right click on profile and generate a mapping
9. View the mapping and see the results and behavior.
2 Analyst Tool
Informatica Data Quality 9.0
Informatica Data Quality 9.1