52
Informationsintegrati on Information Quality 26.1.2006 Felix Naumann

Informationsintegration Information Quality

  • Upload
    benard

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Informationsintegration Information Quality. 26.1.2006 Felix Naumann. Overview. Motivation: IQ for integrated IS Definition of IQ Optimizing IQ IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration. IIS. - PowerPoint PPT Presentation

Citation preview

Page 1: Informationsintegration Information Quality

Informationsintegration

Information Quality

26.1.2006

Felix Naumann

Page 2: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 2

Overview

Motivation: IQ for integrated IS Definition of IQ Optimizing IQ

IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

Page 3: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 3

Database Management Systems vs. Integrated Information Systems

DBMS

IIS

Page 4: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 4

DBMS Quality vs. IIS Quality

Complete (assumed)

Accurate Trusted Fast Free

Incomplete Inaccurate Untrusted Slow Possible cost

High expectations High quality

Low expectations Low quality

Page 5: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 5

Datenqualität vs. Datenfehler

Qualität kann nicht einzig durch Data Cleansing erhöht werden. Ansehen, Objektivität, …

Accuracy ≠ Quality Duplicates ≠ Quality

Page 6: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 6

Datenqualität in IIS

Integrierte Informationssysteme besonders anfällig für Qualitätsprobleme Probleme akkumulieren

Qualität der Ursprungsdaten (Eingabe, Fremdfirmen,...)

Qualität der Quellsysteme (Konsistenz, Constraints, Fehler, ...)

Qualität der Integrationsprozesse Parsen, Transformieren Mappings

Probleme treten erst bei integrierter Sicht zu Tage

Page 7: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 7

Example: Customer Relationship Management (CRM)

Probleme im CRM eines Multi-Channel Vertriebs Kunden doppelt geführt Kunden falsch bewertet Falsche Adressen Haushalte / Konzernstrukturen nicht erkannt

Folgen „False positives“: Verärgerte Kunden durch mehrere /

unpassende Mailings „False negatives“: Verpasste Gelegenheiten durch

fehlende / falsche Zuordnung (Cross-Selling) Sinnlose Portokosten bei falschen Adressen

Quelle: Prof. Ulf Leser (VL Data Warehouses)

Page 8: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 8

Cost of Dirty Data

A.T. Kearny: 25%-40% der operativen Kosten entstehen durch schlechte Datenqualität.

Data Warehouse Institute: Industrie und Verwaltung in den USA verlieren jährlich 600 Milliarden USD.

SAS Studie: Nur 18% der Deutschen Betriebe vertrauen ihren Daten.

AT&T (70er): 20-30% aller Anschlüsse unbenutzt wegen schlechter Daten.

80% aller Krankenhaus Datensätze enthalten Fehler. Hmmm...

...

Page 9: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 9

Optimize IQ!

Fixed quality complete &

correct Optimize cost

time & throughput

Fixed cost price, patience, …

Optimize IQ IQ criteria

Page 10: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 10

Overview

Motivation: IQ for integrated IS Definition of IQ Optimizing IQ

IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

Page 11: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 11

Information Quality (IQ) Was ist Informationsqualität ?

„Fitness for use“ “User satisfaction” Anwendungsabhängig

Folgen geringer Datenqualität Falsche Prognosen Verpasstes Geschäft

Qualität ist besonders bei integrierten Informationen interessant Oft keine Kontrolle über Informationsquellen (Autonomie!) Oft zweifelhafte Qualität

Internet macht Publikation leicht Vielzahl verfügbarer Quellen

Page 12: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 12

IIS Quality Criteria

IQ :=

“Even though quality cannot be defined, you know what it is.”

Robert Pirsig

Page 13: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 13

Information Quality (IQ)

IQ := {Understandability, Reputation, Reliability, Timeliness, Availability, Price, Consistency, Coverage, Response time, Density, Completeness, Amount, Accuracy, Relevancy, ... }

Page 14: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 14

IQ Classification of [WS96]

Intrinsic IQ Believability, Accuracy, Objectivity, Reputation

Contextual IQ Value-added, Relevancy, Timeliness, Completeness,

Amount

Representational IQ Interpretability, Understandability, Repr. Consistency, Repr.

conciseness

Accessibility IQ Accessibility, Security

Page 15: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 15

Content-based IQ Criteria …concern the actual data. Accuracy

is the extent to which data is correct, reliable, and certified free of error. [WS96] Completeness

is the extent to which data is not missing and is of sufficient breadth, depth, and scope for the task at hand. [WS96]

Customer support is the amount and usefulness of human help via email or telephone.

Documentation is the amount and usefulness of documents with metadata.

Interpretability is the extent to which data is in appropriate languages, symbols, and units, and the definitions are

clear. [WS96] Relevancy (or relevance)

is the extent to which data is applicable and helpful for the task at hand. [WS96] Reliability

is the degree to which the user can trust the information Value-Added

is the extent to which data is beneficial and provides advantages from its use. [WS96]

Page 16: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 16

Technical IQ Criteria …concern software and hardware. Accessibility (or availability)

of a DBMS is the probability that a feasible query is correctly answered in a given time range.

Is the extent to which data are available or easily and quickly receivable [WS96]. Latency

is the amount of time in seconds from issuing the query until the first data item reaches the user

Price (cost effectiveness) is the amount of money a user has to pay for a query. is the extent to which the cost of collecting appropriate data is reasonable [WS96].

Response time measures the delay in seconds between submission of a query by the user and

reception of the complete response from the IS. Security

is the extent to which access to data is restricted appropriately to maintain its security [WS96].

Timeliness is the extent to which the age of the data is appropriate for the task at hand [WS96].

Page 17: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 17

Intellectual IQ Criteria

…concern subjective aspects. Believability

is the extent to which data is regarded as true, real, and credible [WS96].

Objectivity is the extent to which data is unbiased,

unprejudiced, and impartial [WS96]. Reputation

is the extent to which data is trusted or highly regarded in terms of its source or content [WS96].

Page 18: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 18

Instantiation-related IQ Criteria …concern the presentation of retrieved data. Amount of data

is the extent to which the quantity or volume of available data is appropriate [WS96].

Representational conciseness is the extent to which data is compactly represented without being

overwhelming [WS96]. Representational consistency

is the extent to which data is always represented in the same format and are compatible with previous data [WS96].

Understandability (ease of understanding) is the extent to which data are clear without ambiguity and easily

comprehended [WS96]. Verifiability (traceability)

Is the extent to which data are well documented, verifiable, and easily attributed to a source [WS96].

Page 19: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 19

IQ Criteria (classical) Accuracy

Definition: Usually: Percentage of incorrect tuples For integration: Percentage of incorrect data values

Assessment: Domain and Constraint Testing Lookup tables Scientific measurements Data-input experience

Improvement: Often: Deletion Better: “Data Scrubbing”

Page 20: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 20

IQ Criteria (classical) Response Time

Definition: Usually: Time until complete query result is received For integration: Latency

Assessment: “Cost Calibration” Continuous assessment

Improvement: Source selection Classical optimization Federated Optimization

Page 21: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 21

IQ Criteria (new)

Completeness Definition:

Coverage: Number of real world objects represented Density: Number of attributes covered For IIS: NULL-values

Assessment: Sampling Existing Metadata

Improvement: Source selection “Best k” vs. “k best”

Page 22: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 22

IQ Criteria (new)

Reputation / Trust Definition:

Reputation: Memory and summary of behavior from past transactions

Trust: Expectation about future behavior Assessment:

Individual experience Corporate guidance Trust-networks

Improvement: ???

Page 23: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 23

Overview

Motivation: IQ for integrated IS Definition of IQ Optimizing IQ

IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

Page 24: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 24

Optimize IQ!

Fixed quality complete &

correct Optimize cost

time & throughput

Fixed cost price, patience, …

Optimize IQ IQ criteria

Page 25: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 25

A New Optimization Paradigm – Many Changes

DBMS Cost criteria Cost model Optimization

algorithm

Integrates IS Quality criteria Quality model Optimization algorithm + Information

integration

Page 26: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 26

DB Cost Criteria

Response time Execution time Latency Throughput Cardinality …

Assessed through system parameters and statistics.

Page 27: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 27

IIS Quality Criteria

{Understandability, Reputation, Reliability, Timeliness, Availability, Price, Consistency, Coverage, Response time, Density, Completeness, Amount, Accuracy, Relevancy, ... }

IQ :=

Assessed in 3 classes…

Page 28: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 28

IQ-Assessment

Subjekt

Anfrage

Prozess Objekt

Vollständigkeit Zeitnähe ...

Verfügbarkeit Antwortzeit ...

Relevanz Glaubwürdigkeit ...

Page 29: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 29

IQ-Assessment

Information source: S1 S2 S3 Understandability 5 7 7 Reputation 5 5 7 Reliability 2 6 4 Age 30 30 2 Availability 99 99 60 Completeness 1 1 0.5 Response Time 0.2 0.2 0.2 Accuracy 99.9 99.9 99.8 Relevancy 60 80 90

Page 30: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 30

Overview

Motivation: IQ for integrated IS Definition of IQ Optimizing IQ

IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

Page 31: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 31

DB Cost Models

Operators + (add) max x (multiply)

Page 32: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 32

A Quality Model for Integrated IS

2 Problems Many Dimensions Multidimensional

S1(95,0,0.7,1,99.95,60,48.2,0)

S2(99,0,1,0.2,99.9,80,52.8,0)

S3(95,0,0.7,1,99.95,60,38,3)(?, ?, ?, ?, ?, ?, ?, ?)

(?, ?, ?, ?, ?, ?, ?, ?)⊓

aggregated IQ-vector

IQ-vector

Mergingoperators

Page 33: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 33

IQ Merge FunctionsAvailability: A BPrice: A + BResponse Time: max[A, B]Coverage: Sylvester

Merge IQ in many Dimensions

(94.05,0,1,1,99.85,48,54.86,0)

(89.35,0,1,1,99.8,28.8,76.06,3)

S1(95,0,0.7,1,99.95,60,48.2,0)

S2(99,0,1,0.2,99.9,80,52.8,0)

S3

merge

merge

(95,0,0.7,1,99.95,60,38,3)

Page 34: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 34

Multidimensional IQ

IQ-criteria have Different units Different ranges Different importance

So... convert scale weight

(89.35,0,1,1,99.8,28.8,76.06,3) > (82.35,0,2,1.5,95,32,71.77,2) ?

MADM methods:SAW, TOPSIS, ELECTRE, AHP, DEA

Page 35: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 35

Overview

Motivation: IQ for integrated IS Definition of IQ Optimizing IQ

IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

Page 36: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 36

DB-type Optimization

Goal Minimize response time Maximize throughput

Restrictions Complete Correct (not just accurate: filter conditions…)

Find best plan!

Page 37: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 37

DB-type Optimization

(name, company)

emp comp

⋈(compID=ID)

(salary > 1000)

(name, company)

compemp

⋈(compID=ID)

(salary > 1000)

SELECT name, company FROM emp, comp WHERE emp.compID = comp.ID AND emp.salary > 1000

Page 38: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 38

DB-type Optimization

(name, company)

compemp_1

⋈(compID=ID)

(salary > 1000)

emp_n...

Page 39: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 39

DB-type Optimization

(name, company)

comp

emp_1

⋈(compID=ID)

(salary > 1000)

emp_n...

MERGE

(name, company)

comp

emp_1

⋈(compID=ID)

(salary > 1000)

emp_n...

MERGE

(salary > 1000)

Page 40: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 40

IIS-type Optimization

Change is efficient But: Result can be

incomplete.

Preferences?

(name, company)

comp

emp_1

⋈(compID=ID)

(salary > 1000)

emp_n...

MERGE

(salary > 1000)

Page 41: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 41

Overview

Motivation: IQ for integrated IS Definition of IQ Optimizing IQ

IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

Page 42: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 42

IIS-type Optimization

Goal Maximize information quality (Maximize completeness)

Restrictions Price Bandwidth Time (user patience)

Find K best sources – Find best K sources

Page 43: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 43

IIS-type Optimization

K best sources Simple IQ model, but Sources may not complement each other

at tuple level (replication)at attribute level

Best K sources Finds optimal query result Uses IQ merging

Page 44: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 44

Naive (3-phase) approach [NLF99]

Input

Query,views,IQ scores

Phase 1

Source selection

Phase 3

Plan selection

Output

Quality-ranked plans

Phase 2Query planning

Good: executes only best plans Bad: still needs to compute all plans

BA

Page 45: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 45

Integrated (single-phase) approach [LN00]

Input

Query, views,and IQ scores

Output

Quality-ranked plans

HiQ B&B: one phase, quality-based branch & bound algorithm

Good: executes only best plans Good: computes only a fraction of all plans

Page 46: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 46

Overview

Motivation: IQ for integrated IS Definition of IQ Optimizing IQ

IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

Page 47: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 47

Conflict Resolution

amazon.comamazon.com

bn.combn.com

IDID max lengthmax length MINMIN CONCATCONCAT

$5.99Moby DickHerman Melville0766607194

$3.98H. Melville0766607194

These are IQ considerations!

These are IQ considerations!

These are IQ considerations!

Page 48: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 48

Conflict Resolution

null = unknown

Internal Conflict-Resolution Function

Page 49: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 49

Conflict Resolution

Numerical:SUM, AVG, MAX, MIN, …

Non-numerical:MAXLENGTH, CONCAT, AnnCONCAT,…

Special:RANDOM, COUNT, CHOOSE, FAVOR, MaxIQ,…

Domain-specific…

Human speci-fication

of IQ

Automated speci-fication

of IQ

Page 50: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 50

Conflict Resolution

amazon.comamazon.com

bn.combn.com

IDID max lengthmax length MINMIN CONCATCONCAT

$5.99Moby DickHerman Melville0766607194

$3.98H. Melville0766607194

slow, up-date, complete, …

fast, outdated, incomplete, …

Page 51: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 51

MaxIQ Resolution Function

Per attribute Choose value of higher quality source.

Per tuple Choose tuple of higher quality source.

Per source Choose K best sources Choose best K sources

Page 52: Informationsintegration Information Quality

26.1.2006 Felix Naumann, VL Informationsintegration, WS 05/06 52

Literatur [RD00] Data Cleaning: Problems and Current Approaches, Rahm

& Do,  IEEE Bulletin 23(4), 2000. [MF03] Problems, Methods, and Challenges in Comprehensive

Data Cleansing. Heiko Müller, Johann-Christoph Freytag, Technical Report HUB-IB-164, Humboldt University Berlin, 2003

[WS96] Richard Y. Wang and Diane M. Strong. Beyond accuracy: What data quality means to data consumers. Journal on Management of Information Systems, 12(4):5-34, 1996.

[NLF99] Felix Naumann, Ulf Leser, and Johann-Christoph Freytag: Quality-driven Integration of Heterogenous Information Systems, VLDB 1999

[LN00] Ulf Leser and Felix Naumann: Query Planning with Information Quality Bounds, FQAS 2000.