Upload
cecil-reed
View
224
Download
2
Embed Size (px)
Citation preview
®
IBM Software Group
©IBM Corporation
IBM Information Server
Cleanse - QualityStage
IBM Software Group
IBM Information ServerDelivering information you can trust
Understand
Cleanse Transform Deliver
Discover, model, and govern information
structure and content
Standardize, merge,and correct information
Combine and restructure
information for new uses
Synchronize, virtualize and move information for in-
line delivery
ParallelProcessing Connectivity Metadata DeploymentAdministration
Platform Services
Support for Service-Oriented Architectures
22
IBM Software Group
3
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand
Transform Deliver
Parallel ProcessingRich Connectivity to Applications, Data, and
Content
IBM Information Server
Unified Deployment
Unified Metadata Management
Cleanse
WebSphere QualityStageData cleansing, standardization, matching, and survivorship for enhancing data quality
and creating coherent business views
IBM Software Group
Need for Data Quality
4
Critical Problems Need to create & maintain 360 degree views of
customers, suppliers, products, locations, events Need to leverage data - make reliable decisions,
comply with regulations, meet service agreementsWhy? No common standards across organization Unexpected values stored in fields Required information buried in free-form fields Fields evolve - used for multiple purposes No reliable keys for consolidated views Operational data degrades 2% per month
Alternative Approaches Denial – problem misunderstood and ignored until
too late; load and explode Hand-coding - clerical exception processing; very
time consuming and resource intensive Simplistic cleansing apps - evolved from direct
marketing & list hygiene, lack flexibility
Kent Fried Chick
Kentucky Fried
Kentucky Fried Chicken
KFC
Molly Talber DBA KFC
Mrs. M. Talber
John & Molly Talber
Talber, KFC, ATIMA
Data Sources Data ValuesData Sources Data Values
227G CB&NATURAL STICKMOZZ WRAPPER
227G CB&NAT STICK P QUE/MOZZ WRAPP.
4
IBM Software Group
Why Should I Care About Cleansing Information?
Lack of information standards Different formats & structures
across different systems
Data surprises in individual fields Data misplaced in the database
Information buried in free-form fields
Data myopia Lack of consistent identifiers inhibit
a single view
The redundancy nightmare Duplicate records with a lack of
standards
Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116
Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116
Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116
Name Tax ID Telephone
J Smith DBA Lime Cons. 228-02-1975 6173380300Williams & Co. C/O Bill 025-37-1888 415-392-20001st Natl Provident 34-2671434 3380321HP 15 State St. 508-466-1200 Orlando
WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH
WING ASSEMBY, USE 5J868-A HEX BOLT .25” - DRILL FOUR HOLES
USE 4 5J868A BOLTS (HEX .25) - DRILL HOLES FOR EA ON WING ASSEM
RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM)
19-84-103 RS232 Cable 6' M-F CandS
CS-89641 6 ft. Cable Male-F, RS232 #87951
C&SUCH6 Male/Female 25 PIN 6 Foot Cable
90328574 IBM 187 N.Pk. Str. Salem NH 0145690328575 I.B.M. Inc. 187 N.Pk. St. Salem NH 0145690238495 Int. Bus. Machines 187 No. Park St Salem NH 0415690233479 International Bus. M. 187 Park Ave Salem NH 0415690233489 Inter-Nation Consults 15 Main Street Andover MA 0234190345672 I.B. Manufacturing Park Blvd. Bostno MA 04106
5
IBM Software Group
Importance of Data Quality
Low data quality impacts an organization in several ways Poor data quality leads to misguided marketing promotions
Cross sell opportunities may be missed because same customer appears several times in slightly different ways
Valued customers may not be recognized during support calls or other important touchpoints
Data mining is difficult because related items are not detected as related
What is good data quality? Two percent of “bad” data doesn’t sound that bad?
Two percent of 10M rows means that you have 200K errors
200K errors add up to big problem for analytics/operations/anything!
6
IBM Software Group
Compliance
Business to Business Standards
Risk Management
Reduce Costs & Increase Productivity
Increase Revenue / CRM Payoff
Business Intelligence Payoff
Supply chain collaboration & item synchronization
Inventory consolidation
Single view of a customer or supplier
ERP Implementations
ERP instance consolidation
IT System renovation
Consolidation resulting from M&A activity
Enterprise Data Warehouse
Compliance & Regulatory projects (SOX, HIPAA, ACCORD, etc.)
Enterprise initiatives……to satisfy critical business requirements.
…need high quality data…
7
IBM Software Group
IBM WebSphere QualityStage
Shared design environment with DataStage increases functionality and reduces development time
Visual match rule interface simplifies match tuning
Service orientation provides ‘continuous’ quality & delivers confidence in your data
Parallel architecture shortens execution time
8
IBM Software Group
9
Database with Consolidated
Views
1. Free Form Investigation
2. Data Standardization
3. Data Matching
4. Data Survivorship
WebSphere QualityStage Process
Customers
Transactions
Vendors / Suppliers
Target
Products / Materials
How will you get an accurate, consolidated view of your business?
IBM Software Group
10
Why Investigate
Discover trends and potential anomalies in the data
100% visibility of single domain and free-form fields
Identify invalid and default values
Reveal undocumented business rules and common terminology
Verify the reliability of the data in the fields to be used as matching criteria
Gain complete understanding of data within context
IBM Software Group
11
Investigation - Free Form
Parsing:Separating multi-valued fields into individual pieces
“The instructions for handling the data are inherent within the data itself.”
123 | St. | Virginia | St.
VirginiaVirginia
Lexical analysis:Determining business significance of individual pieces
Context Sensitive:Identifying various data structures and content
number street state street type type
123 | St. | Virginia | St.
House Street Street Number Name Type
123 | St. Virginia | St.
123123 St.St. St.St.
IBM Software Group
12
Rule Sets Pre-defined rules for parsing and
standardizing: Name Address Area (City, State and Zip)
Multi-national address processing
Validate structure: Tax ID US Phone Date Email
Append ISO country codes
Pre-process or filter name, address and area
Rule sets are stored in the common repostiory
IBM Software Group
13
Standardization - Example
Input File:
Address Line 1 Address Line 2
639 N MILLS AVENUE ORLANDO, FLA 32803306 W MAIN STR, CUMMING, GA 301303142 WEST CENTRAL AV TOLEDO OH 43606843 HEARD AVE AUGUSTA-GA-309041139 GREENE ST ACCT #1234 AUGUSTA GEORGIA 309014275 OWENS ROAD SUITE 536 EVANS GA 30809
Result File:
House # Dir Str. Name Type Unit No. NYSIIS City SOUNDEX State Zip ACCT#
639 N MILLS AVE MAL ORLANDOO645 FL 32803 306 W MAIN ST MAN CUMMINGC552 GA 30130
3142 W CENTRAL AVE CANTRAL TOLEDO T430 OH 43606
843 HEARD AVE HAD AUGUSTA A223 GA 30904
1139 GREENE ST GRAN AUGUSTA A223 GA 30901 1234
4275 OWENS RD STE 536 ON EVANS E152 GA 30809
IBM Software Group
14
Why Match
Identify duplicate entities within one or more files
Perform householding
Create consolidated view of customer
Establish cross-reference linkage
Enrich existing data with new attributes from external sources
IBM Software Group
15
WILLIAM J KAZANGIAN 128 MAIN ST 02111 12/8/62
WILLAIM JOHN KAZANGIAN 128 MAINE AVE 02110 12/8/62
Are these two records a match?
Deterministic Decisions Tables:• Fields are compared• Letter grade assigned• Combined letter grades are compared to a vendor delivered file• Result: Match; Fail; Suspect
B B A A B D B A = BBAABDBA +5 +2 +20 +3 +4 -1 +7 +9 = +49
Probabilistic Record Linkage:• Fields are evaluated for degree-of-match• Weight assigned: represents the “information content” by value• Weights are summed to derived a total score• Result: Statistical probability of a match
Two Methods to Decide a Match
IBM Software Group
16
Why Survive
Provide consolidated view of data
Provide consolidated view containing the “best-of-breed” data
Resolve conflicting values and fill missing values
Cross-populate best available data
Implement business and mapping rules
Create cross-reference keys
IBM Software Group
17
Survivorship - Example
Survivorship Input (Match Output)Group Legacy First Middle Last No. Dir. Str. Name Type Unit
No.1 D150 Bob Dixon 1500 SE ROSS CLARK CIR1 A1367 Robert Dickson 1500 ROSS CLARK CIR
23 D689 Ernest A Obrian 5901 SW 74TH ST STE 20223 A436 Ernie Alex O’Brian5901 SW 74TH ST23 D352 Ernie Obrian 5901 74 ST # 202
Consolidated Output Group First Middle Last No. Dir. Str. Name Type Unit No.
1 Robert Dickson1500 SE ROSS CLARK CIR
23 Ernie Alex O’Brian 5901 SW 74TH ST STE 202
GroupLegacy1 D150
1 A1367
23 D68923 A43623 D352
IBM Software Group
18
How Does WebSphere QualityStage Integrate
Database
DB2OracleSybaseOnyxIDMSetc.
Target
1. Investigation2. Standardizati
on3. Integration4. Survivorship
QualityStage
Data Extraction and Load Routines
DB2OracleSybaseOnyxIDMSetc.
IBM Software Group
19
WebSphere DataStage andWebSphere QualityStage: Fully Integrated!
IBM Software Group
QualityStage: Data Quality Extensions
IBM WebSphere QualityStage GeoLocator
IBM WebSphere QualityStage Postal Verification ProductsWAVES (WorldWide)
IBM WebSphere Worldwide Address Verification Solution
IBM WebSphere QualityStage Postal Certification ProductsCASS (United States)
SERP (Canada)
DPID (Australia)
IBM Information Server Data Quality Module for SAP
IBM WebSphere QualityStage for Siebel2020
IBM Software Group
Key Strengths for IBM QualityStage
Intuitive, “Design as you think” User InterfaceSimple rule design & fine tuning
Seamless Data Flow integration
Intuitive rule design & fine tuning
Defining the technology standard with SOA
Industry leading probabilistic matching engine
2121
®
IBM Software Group
©IBM Corporation
Thank You