Supporting Imprecision in Multidimensional Databases Using Granularities

11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Supporting Imprecision in Multidimensional Databases

Using Granularities

T. B. Pedersen1,2, C. S. Jensen2, and C. E. Dyreson2

1 Center for Health Information Services, Kommunedata, www.kmd.dk

2 Nykredit Center for Database Research,

Department of Computer Science,

Aalborg University, www.cs.auc.dk/NDB

11th SSDBM, Cleveland, Ohio, July 28-30, 1999 2

Talk Overview

• Motivation• Data model and query language context• Handling imprecision• Alternative queries• Imprecision in grouping• Imprecision in computations• Presenting imprecise results• Using pre-aggregated data• Conclusion


Motivation

• Online Analytical Processing (OLAP) tools are increasingly used in many different areas:

Business applications Medical applications Other scientific applications

• Data “imperfection” is a problem for medical and other applications.

Some data is missing. Some data has varying degrees of imprecision.

• Current OLAP tools assume that data imperfections are handled during the “data cleansing” process.

Not realistic for most cases Introduces mapping errors Hides the “true quality” of the data


Previous Work

• Imprecision versus uncertainty Imprecision is a property of the content of an attribute. Uncertainty concerns the degree of truth of a statement. We handle only imprecision. Our focus is on imprecision in aggregate queries.

• Most work on imprecision deals with relational databases Fuzzy sets - specifies a degree of set membership for a value Partial values - one of a set of values is the true value Multiple imputation - substitute multiple values for a missing value High computational complexity

• Only “incomplete datacubes” have treated imprecision in multidimensional databases.

Imprecision fixed at schema level Imprecision in computations not handled


A Diabetes Case Study

• E/R schema of case study• Patients have a diagnosis.• Diagnoses may be missing.• Diagnoses may be specified at

the Low-level Diagnosis or Diagnosis Group level.

• HbA1c% (long-term blood sugar level) measured with several methods of varying precision.

DiagnosisDiagnosis

FamilyIs part of

(1,n)(1,1)

* Code* Text

Patient

* Name* SSN* HbA1c%* Precision

(0,1)Has

Low-levelDiagnosis

Diagnosis

D

(0,n)


Data Model - Schema• Fact type: Patient• Dimension types: Diagnosis

and HbA1c%• There are no “measures”, all

data are dimensions.• Category types: Low-level

Diagnosis, Diagnosis Group, Precise, Imprecise

• Top category types: corresponds to ALL of the dimension.

• Bottom category types: the lowest level in each dimension

• The category types of a dimension type form a lattice.

• Category types ~ granularities Patient

Diagnosis Group

LL Diagnosis

Diagnosis Dimension

Type

HbA1c% Dimension

Type

TDiagnosis THbA1c%

Precise

Imprecise


Data Model - Instances• Categories: instances of

category types, consist of dimension values.

• Top categories contain only one “T” value.

• Dimensions = categories + partial order on category values

• Facts: instances of fact type with separate identity.

• Fact-dimension relations: links facts to dimensions, may map to values of any granularity.

• Multidimensional object (MO) = schema + dimension + facts + fact-dimension relations

Diagnosis Dimension

HbA1c% Dimension

T

Diabetes

ID Diabetes

NID Diabetes

T

Jim John Jane

5 6 7

5.4 5.5 .. 7.0


Algebraic Query Language• Close to relational algebra with

aggregation functions Selection, projection, rename,

union, difference, identity-based join operators

• Aggregation operator: takes “grouping categories”, aggregation function and result dimension as arguments.

Groups together facts characterized by the same dimension values, applies aggregation function to groups, and places result in result dimension.

Example: COUNT by Diagnosis Family and THbA1c%

Diagnosis Dimension

Result Dimension

T T

Diabetes

{Jim,John,Jane}

0-2 >2

0 1 2 3 ..


Overview



Handling Imprecision• We use the granularity of the

data to capture imprecision.• The dimensions specify an

imprecision hierarchy.• Data is mapped to the

appropriate granularity.• Pseudo-code:

Procedure EvalImprecise(Q,M) if PreciseEnough(Q,M) then Eval(Q,M) else Q’=Alternative(Q,M) if Q’ is ok then Eval(Q’,M) else

Handle Imprecision in Grouping Handle Imprecision in Computation

Return Imprecise Result end if end if

Most Imprecise=T=Unknown

More Imprecise

More precise

Most precise=bottom category

.

.

.


Alternative Queries• We use a “Precision MO” (PMO)

to capture the data granularities. • The precision MO has a

granularity dimension for every MO dimension D.

• A granularity dimension has two categories: T and GranularityD.

• GranularityD contains a value for every category type in D.

• The set of facts stays the same.• Facts are mapped to the

appropriate GranularityD value.

GranDiagnosis Dimension

GranHbA1c% Dimension

T T

DGLLD TD P I TH

Jim John Jane


Alternative Queries• Queries are rewritten into

“testing queries” on the PMO that counts the number of facts mapped to different granularity combinations (details in paper).

• The result of the testing queries can be used to suggest alternative queries.

• Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO

Testing query: COUNT by GranularityDiagnosis on PMO

Result shows that data is not precise enough to group on Low-level Diagnosis

Alternative query: COUNT by Diagnosis Group

GranDiagnosis Dimension

Result Dimension

T

LLD DG TD 0 1 2

T

{Jim} {John,Jane}


Overview



Imprecision in Grouping

• If facts are mapped to a dimension value of coarser granularity than the grouping category, with which dimension value should they be grouped ? - we do not know !

• So, we return several answers, based on different groupings of facts:

Conservative grouping: only include in a group what is known to belong to it - discard imprecise data.

Liberal grouping: include in a group everything that might belong to it - facts may be in several groups.

Weighted grouping: include everything that might belong, but give more weight to likely members.


Conservative Grouping• Corresponds to the standard

aggregation operator.• Imprecise data are simply

discarded.• Example: COUNT by Low-level

Diagnosis (and THbA1c%)) on MO Jim does not show in any group. The count for both groups is 1. The result is “too conservative”

as not all data is accounted for in the result.

Diagnosis Dimension

Result Dimension

T T

Diabetes

ID Diabetes

NID Diabetes 0 1 2

{John} {Jane}


Liberal Grouping• We modify the aggregation

operator from the query language to compute the liberal grouping (formal semantics in the paper).

• The computed groups may now overlap, i.e., the same fact may be in several groups.


Jim ends up up both groups. The count for both groups is 2. The result is “too liberal” as the

same data may be counted several times in the result.

Diagnosis Dimension

Result Dimension

T

0 1 2

Diabetes

T

ID Diabetes

NID Diabetes

{Jim,John} {Jim,Jane}


Weighted Grouping• Compromise between the

conservative and liberal groupings:

Weights assigned to the partial order on dimension values and to fact membership in groups (groups become fuzzy sets).

Aggregation operator modified to compute the weighted grouping (see paper).


80% of Diabetes patient have ID Diabetes, 20% NID.

Jim ends up up both groups, but weighted differently.

Result is a weighted COUNT (details on next slide).

Diagnosis Dimension

T

Diabetes

ID Diabetes

NID Diabetes

1.0

.8 .2

{Jim.8,John1.0} {Jim.2,Jane1.0}

Result Dimension

T

0 1.2 1.8


Imprecision in Computation

• We also need to handle imprecision in the aggregate computation itself, e.g., the varying precision for HbA1c%.

• In computations, we impute the expected value for values of non-bottom granularity (generalized imputation). This allows normal computation of the result.

• A precision computation is performed along with the aggregate computation:

A granularity computation measure (GCM) is used to capture the imprecision of a dimension value during computation.

A measure combination function (MCF) is used to combine GCMs. The MCF must be distributive.

A final granularity measure (FGM) represent the “true” imprecision of a result value.

A final granularity function (FGF) maps from a GCM to a FGM.


Imprecision in Computation• Example: WAVG(HbA1c%) by

Low-Level Diagnosis We use FGM=WAVG(Level) as

a measure of precision (Precise values have level 0, etc.)

The GCM for a value e with weight w is: (w * Level(e),w)

The MCF is: h((n1, n2),(n3,n4)) = (n1+n3,n2+n4)

The FGF is: f(n1,n2) = n1/n2 6.0 & 7.0 imputed for T & 7 in

WAVG(HbA1c%) computation Jim, John, and Jane have

levels 2, 0, and 1, respectively Results of query: {(IDD,5.7,.9),

(NIDD,5.6,1.2)}

Diagnosis Dimension

HbA1c% Dimension

T T

Diabetes

ID Diabetes

NID Diabetes

Jim John Jane

5 6 7

5.4 5.5 .. 7.0

6.0

7.0

.8 .2

1.0

1

2

0


Presenting Imprecise Results• Several ways to present the

imprecise results to the user:• Show result values along with

the corresponding final granularity measure values.

Very precise estimate of result precision, but hard to grasp.

Resulting (value,FGM) = {(IDD,5.7,.9),(NIDD,5.6,1.2)}

• Map result values into different granularities using a value coarsening function (VCF).

More intuitive result, but less precise estimate of precision.

Example: Weighted grouping with VCF: r(x) = v such that xv and Level(v) = Ceiling(x).

Diagnosis Dimension

Result Dimension

T

Diabetes

ID Diabetes

NID Diabetes

{Jim.8,John1.0} {Jim.2,Jane1.0}

T

5.4 5.5 .. 7.0

5 6 7


Using Pre-Aggregated Data

• The approach can utilize pre-aggregated data effectively.• Aggregate results for the precision MO can usually be fully

materialized due to the relatively small multidimensional space (~ 1.000.000 cells).

• The aggregate computation with expected values can use standard pre-aggregation techniques, e.g., partial pre-aggregation, for good response-time versus storage/update-time tradeoffs.

• The distributive MCF allows for partial pre-aggregation of precision results.


Conclusion

• Current OLAP tools and models do not support the imprecision found in real-world data.

• We have shown an approach to handling imprecision in OLAP databases based on the common multidimensional concept of granularities.

• The approach can suggest alternative queries when data is not precise enough and handles imprecision both in the grouping of data and in the aggregate computation.

• The approach has a low computational overhead and is able to exploit pre-aggregated results effectively.

• The approach can be implemented using existing technology such as SQL and OLAP tools.

• Future work: presentation of results, MIN/MAX functions, etc.

Documents

Supporting Imprecision in Multidimensional Databases Using Granularities