Upload
benard
View
29
Download
0
Embed Size (px)
DESCRIPTION
Informationsintegration Information Quality. 26.1.2006 Felix Naumann. Overview. Motivation: IQ for integrated IS Definition of IQ Optimizing IQ IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration. IIS. - PowerPoint PPT Presentation
Citation preview
Informationsintegration
Information Quality
26.1.2006
Felix Naumann
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 2
Overview
Motivation: IQ for integrated IS Definition of IQ Optimizing IQ
IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 3
Database Management Systems vs. Integrated Information Systems
DBMS
IIS
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 4
DBMS Quality vs. IIS Quality
Complete (assumed)
Accurate Trusted Fast Free
Incomplete Inaccurate Untrusted Slow Possible cost
High expectations High quality
Low expectations Low quality
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 5
Datenqualität vs. Datenfehler
Qualität kann nicht einzig durch Data Cleansing erhöht werden. Ansehen, Objektivität, …
Accuracy ≠ Quality Duplicates ≠ Quality
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 6
Datenqualität in IIS
Integrierte Informationssysteme besonders anfällig für Qualitätsprobleme Probleme akkumulieren
Qualität der Ursprungsdaten (Eingabe, Fremdfirmen,...)
Qualität der Quellsysteme (Konsistenz, Constraints, Fehler, ...)
Qualität der Integrationsprozesse Parsen, Transformieren Mappings
Probleme treten erst bei integrierter Sicht zu Tage
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 7
Example: Customer Relationship Management (CRM)
Probleme im CRM eines Multi-Channel Vertriebs Kunden doppelt geführt Kunden falsch bewertet Falsche Adressen Haushalte / Konzernstrukturen nicht erkannt
Folgen „False positives“: Verärgerte Kunden durch mehrere /
unpassende Mailings „False negatives“: Verpasste Gelegenheiten durch
fehlende / falsche Zuordnung (Cross-Selling) Sinnlose Portokosten bei falschen Adressen
Quelle: Prof. Ulf Leser (VL Data Warehouses)
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 8
Cost of Dirty Data
A.T. Kearny: 25%-40% der operativen Kosten entstehen durch schlechte Datenqualität.
Data Warehouse Institute: Industrie und Verwaltung in den USA verlieren jährlich 600 Milliarden USD.
SAS Studie: Nur 18% der Deutschen Betriebe vertrauen ihren Daten.
AT&T (70er): 20-30% aller Anschlüsse unbenutzt wegen schlechter Daten.
80% aller Krankenhaus Datensätze enthalten Fehler. Hmmm...
...
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 9
Optimize IQ!
Fixed quality complete &
correct Optimize cost
time & throughput
Fixed cost price, patience, …
Optimize IQ IQ criteria
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 10
Overview
Motivation: IQ for integrated IS Definition of IQ Optimizing IQ
IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 11
Information Quality (IQ) Was ist Informationsqualität ?
„Fitness for use“ “User satisfaction” Anwendungsabhängig
Folgen geringer Datenqualität Falsche Prognosen Verpasstes Geschäft
Qualität ist besonders bei integrierten Informationen interessant Oft keine Kontrolle über Informationsquellen (Autonomie!) Oft zweifelhafte Qualität
Internet macht Publikation leicht Vielzahl verfügbarer Quellen
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 12
IIS Quality Criteria
IQ :=
“Even though quality cannot be defined, you know what it is.”
Robert Pirsig
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 13
Information Quality (IQ)
IQ := {Understandability, Reputation, Reliability, Timeliness, Availability, Price, Consistency, Coverage, Response time, Density, Completeness, Amount, Accuracy, Relevancy, ... }
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 14
IQ Classification of [WS96]
Intrinsic IQ Believability, Accuracy, Objectivity, Reputation
Contextual IQ Value-added, Relevancy, Timeliness, Completeness,
Amount
Representational IQ Interpretability, Understandability, Repr. Consistency, Repr.
conciseness
Accessibility IQ Accessibility, Security
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 15
Content-based IQ Criteria …concern the actual data. Accuracy
is the extent to which data is correct, reliable, and certified free of error. [WS96] Completeness
is the extent to which data is not missing and is of sufficient breadth, depth, and scope for the task at hand. [WS96]
Customer support is the amount and usefulness of human help via email or telephone.
Documentation is the amount and usefulness of documents with metadata.
Interpretability is the extent to which data is in appropriate languages, symbols, and units, and the definitions are
clear. [WS96] Relevancy (or relevance)
is the extent to which data is applicable and helpful for the task at hand. [WS96] Reliability
is the degree to which the user can trust the information Value-Added
is the extent to which data is beneficial and provides advantages from its use. [WS96]
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 16
Technical IQ Criteria …concern software and hardware. Accessibility (or availability)
of a DBMS is the probability that a feasible query is correctly answered in a given time range.
Is the extent to which data are available or easily and quickly receivable [WS96]. Latency
is the amount of time in seconds from issuing the query until the first data item reaches the user
Price (cost effectiveness) is the amount of money a user has to pay for a query. is the extent to which the cost of collecting appropriate data is reasonable [WS96].
Response time measures the delay in seconds between submission of a query by the user and
reception of the complete response from the IS. Security
is the extent to which access to data is restricted appropriately to maintain its security [WS96].
Timeliness is the extent to which the age of the data is appropriate for the task at hand [WS96].
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 17
Intellectual IQ Criteria
…concern subjective aspects. Believability
is the extent to which data is regarded as true, real, and credible [WS96].
Objectivity is the extent to which data is unbiased,
unprejudiced, and impartial [WS96]. Reputation
is the extent to which data is trusted or highly regarded in terms of its source or content [WS96].
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 18
Instantiation-related IQ Criteria …concern the presentation of retrieved data. Amount of data
is the extent to which the quantity or volume of available data is appropriate [WS96].
Representational conciseness is the extent to which data is compactly represented without being
overwhelming [WS96]. Representational consistency
is the extent to which data is always represented in the same format and are compatible with previous data [WS96].
Understandability (ease of understanding) is the extent to which data are clear without ambiguity and easily
comprehended [WS96]. Verifiability (traceability)
Is the extent to which data are well documented, verifiable, and easily attributed to a source [WS96].
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 19
IQ Criteria (classical) Accuracy
Definition: Usually: Percentage of incorrect tuples For integration: Percentage of incorrect data values
Assessment: Domain and Constraint Testing Lookup tables Scientific measurements Data-input experience
Improvement: Often: Deletion Better: “Data Scrubbing”
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 20
IQ Criteria (classical) Response Time
Definition: Usually: Time until complete query result is received For integration: Latency
Assessment: “Cost Calibration” Continuous assessment
Improvement: Source selection Classical optimization Federated Optimization
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 21
IQ Criteria (new)
Completeness Definition:
Coverage: Number of real world objects represented Density: Number of attributes covered For IIS: NULL-values
Assessment: Sampling Existing Metadata
Improvement: Source selection “Best k” vs. “k best”
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 22
IQ Criteria (new)
Reputation / Trust Definition:
Reputation: Memory and summary of behavior from past transactions
Trust: Expectation about future behavior Assessment:
Individual experience Corporate guidance Trust-networks
Improvement: ???
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 23
Overview
Motivation: IQ for integrated IS Definition of IQ Optimizing IQ
IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 24
Optimize IQ!
Fixed quality complete &
correct Optimize cost
time & throughput
Fixed cost price, patience, …
Optimize IQ IQ criteria
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 25
A New Optimization Paradigm – Many Changes
DBMS Cost criteria Cost model Optimization
algorithm
Integrates IS Quality criteria Quality model Optimization algorithm + Information
integration
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 26
DB Cost Criteria
Response time Execution time Latency Throughput Cardinality …
Assessed through system parameters and statistics.
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 27
IIS Quality Criteria
{Understandability, Reputation, Reliability, Timeliness, Availability, Price, Consistency, Coverage, Response time, Density, Completeness, Amount, Accuracy, Relevancy, ... }
IQ :=
Assessed in 3 classes…
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 28
IQ-Assessment
Subjekt
Anfrage
Prozess Objekt
Vollständigkeit Zeitnähe ...
Verfügbarkeit Antwortzeit ...
Relevanz Glaubwürdigkeit ...
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 29
IQ-Assessment
Information source: S1 S2 S3 Understandability 5 7 7 Reputation 5 5 7 Reliability 2 6 4 Age 30 30 2 Availability 99 99 60 Completeness 1 1 0.5 Response Time 0.2 0.2 0.2 Accuracy 99.9 99.9 99.8 Relevancy 60 80 90
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 30
Overview
Motivation: IQ for integrated IS Definition of IQ Optimizing IQ
IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 31
DB Cost Models
Operators + (add) max x (multiply)
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 32
A Quality Model for Integrated IS
2 Problems Many Dimensions Multidimensional
S1(95,0,0.7,1,99.95,60,48.2,0)
S2(99,0,1,0.2,99.9,80,52.8,0)
S3(95,0,0.7,1,99.95,60,38,3)(?, ?, ?, ?, ?, ?, ?, ?)
(?, ?, ?, ?, ?, ?, ?, ?)⊓
⊔
aggregated IQ-vector
IQ-vector
Mergingoperators
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 33
⊔
IQ Merge FunctionsAvailability: A BPrice: A + BResponse Time: max[A, B]Coverage: Sylvester
Merge IQ in many Dimensions
(94.05,0,1,1,99.85,48,54.86,0)
(89.35,0,1,1,99.8,28.8,76.06,3)
S1(95,0,0.7,1,99.95,60,48.2,0)
S2(99,0,1,0.2,99.9,80,52.8,0)
S3
merge
merge
(95,0,0.7,1,99.95,60,38,3)
⊓
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 34
Multidimensional IQ
IQ-criteria have Different units Different ranges Different importance
So... convert scale weight
(89.35,0,1,1,99.8,28.8,76.06,3) > (82.35,0,2,1.5,95,32,71.77,2) ?
MADM methods:SAW, TOPSIS, ELECTRE, AHP, DEA
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 35
Overview
Motivation: IQ for integrated IS Definition of IQ Optimizing IQ
IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 36
DB-type Optimization
Goal Minimize response time Maximize throughput
Restrictions Complete Correct (not just accurate: filter conditions…)
Find best plan!
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 37
DB-type Optimization
(name, company)
emp comp
⋈(compID=ID)
(salary > 1000)
(name, company)
compemp
⋈(compID=ID)
(salary > 1000)
SELECT name, company FROM emp, comp WHERE emp.compID = comp.ID AND emp.salary > 1000
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 38
DB-type Optimization
(name, company)
compemp_1
⋈(compID=ID)
(salary > 1000)
emp_n...
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 39
DB-type Optimization
(name, company)
comp
emp_1
⋈(compID=ID)
(salary > 1000)
emp_n...
MERGE
(name, company)
comp
emp_1
⋈(compID=ID)
(salary > 1000)
emp_n...
MERGE
(salary > 1000)
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 40
IIS-type Optimization
Change is efficient But: Result can be
incomplete.
Preferences?
(name, company)
comp
emp_1
⋈(compID=ID)
(salary > 1000)
emp_n...
MERGE
(salary > 1000)
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 41
Overview
Motivation: IQ for integrated IS Definition of IQ Optimizing IQ
IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 42
IIS-type Optimization
Goal Maximize information quality (Maximize completeness)
Restrictions Price Bandwidth Time (user patience)
Find K best sources – Find best K sources
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 43
IIS-type Optimization
K best sources Simple IQ model, but Sources may not complement each other
at tuple level (replication)at attribute level
Best K sources Finds optimal query result Uses IQ merging
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 44
Naive (3-phase) approach [NLF99]
Input
Query,views,IQ scores
Phase 1
Source selection
Phase 3
Plan selection
Output
Quality-ranked plans
Phase 2Query planning
Good: executes only best plans Bad: still needs to compute all plans
BA
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 45
Integrated (single-phase) approach [LN00]
Input
Query, views,and IQ scores
Output
Quality-ranked plans
HiQ B&B: one phase, quality-based branch & bound algorithm
Good: executes only best plans Good: computes only a fraction of all plans
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 46
Overview
Motivation: IQ for integrated IS Definition of IQ Optimizing IQ
IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 47
Conflict Resolution
amazon.comamazon.com
bn.combn.com
IDID max lengthmax length MINMIN CONCATCONCAT
$5.99Moby DickHerman Melville0766607194
$3.98H. Melville0766607194
These are IQ considerations!
These are IQ considerations!
These are IQ considerations!
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 48
Conflict Resolution
null = unknown
Internal Conflict-Resolution Function
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 49
Conflict Resolution
Numerical:SUM, AVG, MAX, MIN, …
Non-numerical:MAXLENGTH, CONCAT, AnnCONCAT,…
Special:RANDOM, COUNT, CHOOSE, FAVOR, MaxIQ,…
Domain-specific…
Human speci-fication
of IQ
Automated speci-fication
of IQ
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 50
Conflict Resolution
amazon.comamazon.com
bn.combn.com
IDID max lengthmax length MINMIN CONCATCONCAT
$5.99Moby DickHerman Melville0766607194
$3.98H. Melville0766607194
slow, up-date, complete, …
fast, outdated, incomplete, …
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 51
MaxIQ Resolution Function
Per attribute Choose value of higher quality source.
Per tuple Choose tuple of higher quality source.
Per source Choose K best sources Choose best K sources
26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 52
Literatur [RD00] Data Cleaning: Problems and Current Approaches, Rahm
& Do, IEEE Bulletin 23(4), 2000. [MF03] Problems, Methods, and Challenges in Comprehensive
Data Cleansing. Heiko Müller, Johann-Christoph Freytag, Technical Report HUB-IB-164, Humboldt University Berlin, 2003
[WS96] Richard Y. Wang and Diane M. Strong. Beyond accuracy: What data quality means to data consumers. Journal on Management of Information Systems, 12(4):5-34, 1996.
[NLF99] Felix Naumann, Ulf Leser, and Johann-Christoph Freytag: Quality-driven Integration of Heterogenous Information Systems, VLDB 1999
[LN00] Ulf Leser and Felix Naumann: Query Planning with Information Quality Bounds, FQAS 2000.