19
DATA QUALITY 9.1 Training IDQ 9.1 Labs Lab 1 - Content Management Service..................................................................................... 2 Lab 2 - New Reference Table Capabilities.............................................................................. 6 Lab 3 - Content Sets................................................................................................................. 7 Lab 4 - Tags............................................................................................................................ 10 Lab 5 - Match Enhancements................................................................................................ 12 Lab 6 – New Exception Transform........................................................................................ 13 Lab 7 - Data Quality for MS Excel.......................................................................................... 16 Lab 8 - Profiling Labs............................................. 17 Informatica Data Quality 9.1

9_1_labs

Embed Size (px)

DESCRIPTION

informatica labs

Citation preview

Lab 0 Informatica Data Quality setup

DATA QUALITY BOOTCAMP

DATA QUALITY 9.1 Training

IDQ 9.1 Labs2Lab 1 - Content Management Service

6Lab 2 - New Reference Table Capabilities

7Lab 3 - Content Sets

10Lab 4 - Tags

12Lab 5 - Match Enhancements

13Lab 6 New Exception Transform

16Lab 7 - Data Quality for MS Excel

17Lab 8 - Profiling Labs

Lab 1 - Content Management Service

Create and Configure CMS

Objective: Configure AddressDoctor options and check AV reference file status from Developer

Steps

Open Administrator Console

Select Action/New/Content Management Service

Follow Wizard to create and start the CMS

Open CMS, processes tab

Edit AV Options

AV Licence: S0PCF4MN94L7ZXEZ635NCSM90NZKR0NJUTWA Set No Pre-Load to ALL for all types

Set AV Reference data path to: C:\Informatica\9.1.0\services\DQContent\INFA_Content\av\default

Recycle CMS Service

Recycle DIS Service

Open Developer

Select Window / Preferences

Select Content Status

Check Status is displayed correctly (expected view below)

Lab 2 - New Reference Table Capabilities

Create Managed Reference Table from database

Objective: Create new reference table using a database source

Steps

Open Analyst Tool

Select Create New Reference Table

Select Connect to a Relational Table

Select DQ_Tables Connection

Select fname table

Select Column1 as valid value

Save as fname in Your Project

Create Unmanaged Reference Table from database

Objective: Create new unmanaged reference table

Steps

Open Analyst Tool

Select Create New Reference Table

Select Connect to a Relational Table

Make sure Unmanaged Table is ticked

Select DQ_Tables Connection

Select us_states table

Select Column1 as valid value

Save as us_states in Your Project

End of Exercise

Lab 3 - Content Sets

Create and configure new Content Set

Objective: Create new content set and content set expressions

Steps

Open Developer

Select File / New / Content Set

Create new Content Set called ContentSet_91 in Your project

Open your content set

Add a new

Add new Character Set:

Name: Char

Label : C

Range: a-z and A-Z

Add a new Regular Expression:

Name: num

Number of Outputs: 1

RegEx: ^[0-9]+$

Add a new Token Sets (RegEx):

Name: date

Label: date

RegEx: ^\d{1,2}\/\d{1,2}\/\d{4}$

Description: matches dates of the form XX/XX/YYYY where XX can be 1 or 2 digits long and YYYY is always 4 digits long.

Save Content Set

Use a Content SetParse, Cleanse and Standardize Data

Objective: Prepare data source for upload to Warehouse and matching scenarios

Steps

Create New Mapping: m_process_customer_data

Add customer Flat file source from c:\DQ_DATA directory Add Parser Token Parser

Add Input Port contact_name from source

Create new strategy

Name: parse_names

Operation 1:

Operation: Parse Using Reference Table

Name: parse_fname

Reference Table: fnames (Enablement_91 project)

Output: fname, string, 25

Operation 2:

Operation: Parse Using Reference Table

Name: parse_sname

Reference Table: usa_surnames_infa (Informatica_DQ_Content/Dictionaries/North America/USA)

Output: sname, string, 25

Add Input port address1 from source

Create new strategy

Name: parse_housenum

Operation 1:

Operation: Parse Using Token Set

Select Regular Expression

Choose RegEx num from ContentSet_91

Create new output house_num

Run data viewer and examine your results

Add Labeler

Add Input Port address3

Create New Strategy

Name: label_state

Mode: Token

Operation: Label with Reference Table

Reference Table: us_states (Your project)

Label: state]

Add Labeler ( or use exiting one)

Add Input Port cust_start_date from source

Create new strategy

Name: lbl_date

Mode: Token

Operation 1:

Operation: Label Tokens with Token Set

Name: lbl_date

Select Token Set date from ContentSet_91

Add Input Port currency from source

Create new strategy

Name: lbl_currency

Mode: Token

Operation 1:

Operation: Label Tokens with reference table

Informatica_DQ_Content/dictionaries/general/currency_codes_infa

Name: lbl_currency

Run the data viewer and examine your results

Should look something like this:

Lab 4 - Tags

Create and associate new tags - Developer

Objective: Create new tags and associate to objects in Developer

Create Tag Steps

Open Developer

Open Window / Preferences

Select Tags

View Out of the Box Tags (These will appear when you install 9.1 accelerators, which this image does not have) Create new tags:

Customer

Product

Content

Associate Tag Steps

Open Developer

Apply Tags to Data Sources

Open Source

Navigate to Tags View

Select Edit

Apply Tag

Apply Tags to content set expressions

Open Content Set

Navigate to Tags View

Select Edit

Apply Content Tag to all elements

Create and associate new tags - Analyst

Objective: Create new tags and associate to objects in Analyst

Create Tag Steps

Open Analyst

Select Actions / Show Tags

Create new tag:

Order

Address

Associate Tag Steps

Open Profile_order

(You probably have not already profiled the order table. Create a data object using the flat file order in the c:\DQ_DATA directory and profile it, columns only.) Apply Tags to data columns

Show Tags view

Select Address related columns

Apply Address Tag

Select Order related columns

Apply Order Tag

Select Customer name related columns

Apply Customer Tag

Go to project view

Select RTM fnames

Apply Customer and Content TagsLab 5 - Match Enhancements

Pre-req for image

Copy 3 .ysp files from C:\Informatica\9.1.0\services\DQContent\INFA_Content\identity\

To a newly created folder called default at

C:\Informatica\9.1.0\services\DQContent\INFA_Content\identity\default

Key Gen and Match Analysis

Objective: Identify potential duplicates

Steps

Create New Mapping: m_match_customer

Add c:\DQ_Data\aml_demo_data source

Add Key Generator

Use String strategy on iso_ctry_code

right click on the key generator and Select Analyse Detail from the menu

Review the following information:

Estimated processing time

Groups above the recommended threshold

Edit the desired throughput value and observe how estimated processing time changes

Edit min and max group size values

Select groups above the threshold from the dropdown list and drilldown to the record level

Re-configure Key Generator to zip_or_postcode port

Run GroupKey Analysis again and observe the results

Add Match transform and configure as follows

Field Matching (Single Source)

Edit Distance on contact_name

Threshold 0.6

Select Runtime Analysis by MatchType from the right click menu

Review results

Select Output Analysis of Clusters from the right click menu

Review results (There is a bug in this Beta release so you may get funny results) Repeat the above steps for a Match transform configured for Identity Matching

Lab 6 New Exception Transform

Objective: Identify and manually correct exception records

Think of this as further down in a mapping where you have already run data cleansing and address validation. You have also run a data quality check on the phone field. Now that youve done that, you need to decide what records pass and which need manual intervention.

Steps

Create New Mapping: m_exception_records

Add c:\DQ_data\cleansed_customer source

Add Decision

Assign a score of 60 to records with AddressStatus Incomplete Address or Invalid Address Line and Phone Status Incomplete Phone

Assign a score of 90 to all remaining records

if AddressStatus= 'Incomplete Address'or AddressStatus='Invalid Address Line'or PhoneStatus='Incomplete Phone'thenscore:=60elsescore:=90endif Add Exception transform and configure as follows

Bad Records Exception

Table BAD RECORDS in Staging DB

Connect data ports

Add AddressStatus and PhoneStatus ports to Labels input

Connect the score port to Inputs >> Control >> Score.

Records with a score between 40 and 90 to be reviewed manually

Send good records to standard output and bad records to Bad Records table

Map AddressStatus and PhoneStatus ports to respective issues in the priority tab

Run the mapping (Data Viewer) Run the data viewer on the Exception transform In Analyst, add the newly created table to the project view

Review available filters

Select Edit mode

Correct a number of records and select Save All Corrected Records

Open the audit trail and review the changes made

Hover over the new value to see the old value

Lab 7 - Data Quality for MS Excel

Install DQ for Excel

Objective: Perform base install of DQ for Excel

Steps

Extract lwp3.zip (from lab files provided) to Desktop

Close Excel

Run setup.exe

Note this will connect to internet to download base Excel DLLs required to run Add-In

Once complete, open Excel

Check Informatica Ribbon is available

Use DQ for Excel

Objective: Add new service to DQ for Excel and use with sample data

Steps

Export WSDL from service created in DQ/WS lab

Open Excel

Select Informatica Ribbon and Add Service

Point to WSDL file extracted from first step

Open customer.xlsx (see Desktop on 9.1 TTT VM, folder DQ_data_enablement91)

Use imported service to parse Name column into First and Last names

Lab 8 - Profiling Labs

Profiling

1. Add 3 Data Objects (Location: c:\DQ_DATA) Tool Matters (make sure you click the because there are embedded commas in the data)a. Customer Orders

b. Product

2. Profile Customer Columns only

3. Delete profile

4. Profile All the tables at once

a. 2 ways

i. Select all the data objects and profile

ii. Profile one add the other two

1. Add a Prefix of DW_ to the objects

5. Profile Customer Column, Primary Key and Dependency

a. Take all the defaults (but keep hitting next not finish to see them)

Remember to select the columns for PK and Functional Dependency

b. View PK Results

c. Select Cust_number verify

i. What happened to the display

d. View Violations

e. Select cust_Number 15952672 and Drill down

i. What is the difference between the records?

f. View Functional Dependencies

i. Why are there blank determinants?

g. Select a column and verify

i. What happened in the Display?

6. Profile Orders All three (Column, Primary Key and Dependency) and take defaults

a. View column profile

i. What is the Key to this table?

b. View Primary Key inference

i. What is the key?

c. View the Functional Dependencies

i. Can you identify any potential Sub-tables

7. Delete the orders profile

8. Re-profile orders This time override and change the default options

Hint: Be careful in changes to these options otherwise you will be here till Saturday

a. Primary Key Minimum Percent = 100

b. Dependency - Minimum Percent = 100

i. View results

ii. Sort on the determinant column

iii. Can you identify a key?

iv. Can you identify a potential sub-table?

9. Go back and modify the profile description (PD) to change the minimum to 75

i. What does this show you?

10. Delete the profile

11. Profile orders dependency only

a. Exclude all columns from determinants except sales_id

b. Verify all fields that show 100%

12. Go back and modify the profile definition to profile all three tasks

a. Did it work?

13. Profile orders all three tasks

a. Verify the primary key

b. Look at dependencies

i. Why are form, ingredient_list, on_hand and segment not determined by the primary key

ii. How can I fix segment and form to show they are determined by the primary key?

Filters

1. Delete Customer profile

2. Profile customer adding filters

a. Address3 = NY

b. Address3 = CA

c. Address3 = NY and address2 = NEW YORK

3. Run each profile, view results, modify Profile definition and run the next one.

Drilldowns

1. Open the Customer Profile in the Analyst tool

2. Remove any active filters and re-run the profile

3. Go to address3 and drill down on NY

4. Edit the drill down filter and add address3 = NEW YORK (yes case matters)

5. Add another field Zip_or_postcode < 10020

6. Add to the filter, iso_ctry_code is not null or != USA or != U.S.A.

Modeling

1. Create a Profile model called Whatever with orders, customer and product

a. Select the customer object, right click and data object profile

b. Profile all 3 steps setting options you think are appropriate

c. View results

d. In Primary Key inference, right click and add cust_number to model

2. Go back to the default view and create and run a data object profile for orders

a. View results

b. Add item+order to the model

3. Go back to the default view and create and run a data object profile for product

a. Make sure you verify all your keys

b. Add product_id to the model

4. Select customer and orders and profile foreign keys

a. Verify and add relationship to the model

b. Re-select the relationship and view the Venn diagram

c. Double click on the non-overlapping orders.

i. See orders without a valid customer ID

ii. Double click on customer and see all the deadbeat customers.

5. Go back to the default view and do the same for orders and product changing the inference options (Trim values and case sensitivity)

a. Verify the relationship (if it makes sense)

b. Re-select the relationship and view the Venn diagram.

i. Find the products that dont have open orders.

ii. Find the most ordered products.

6. FINALLY (well almost), select customer and do a join profile

a. Add orders

b. Add a join on cust_number and GCLOC

Generating a Mapping from a Profile

1. Open the Customer Profile

2. Add an OOTB rule to validate cust_start_date

a. You may want to copy it and change the date format if you want it to actually work correctly.

3. Add an OOTB rule to validate Last_Order_Date

a. See above

4. Create a rule to validate iso_ctry_code (US is only valid value)

a. IIF(iso_ctry_code ='US','Valid','Invalid')

5. Add an OOTB rule to remove punctuation from contact_name

6. Add more as appropriate or stop to get out early.

7. Run the profile (anything interesting?)

8. Right click on profile and generate a mapping

9. View the mapping and see the results and behavior.

2 Analyst Tool

Informatica Data Quality 9.0

Informatica Data Quality 9.1