Upload
pete
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Supporting Imprecision in Multidimensional Databases Using Granularities. T. B. Pedersen 1,2 , C. S. Jensen 2 , and C. E. Dyreson 2 1 Center for Health Information Services, Kommunedata, www.kmd.dk 2 Nykredit Center for Database Research, Department of Computer Science, - PowerPoint PPT Presentation
Citation preview
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
Supporting Imprecision in Multidimensional Databases
Using Granularities
T. B. Pedersen1,2, C. S. Jensen2, and C. E. Dyreson2
1 Center for Health Information Services, Kommunedata, www.kmd.dk
2 Nykredit Center for Database Research,
Department of Computer Science,
Aalborg University, www.cs.auc.dk/NDB
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 2
Talk Overview
• Motivation• Data model and query language context• Handling imprecision• Alternative queries• Imprecision in grouping• Imprecision in computations• Presenting imprecise results• Using pre-aggregated data• Conclusion
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 3
Motivation
• Online Analytical Processing (OLAP) tools are increasingly used in many different areas:
Business applications Medical applications Other scientific applications
• Data “imperfection” is a problem for medical and other applications.
Some data is missing. Some data has varying degrees of imprecision.
• Current OLAP tools assume that data imperfections are handled during the “data cleansing” process.
Not realistic for most cases Introduces mapping errors Hides the “true quality” of the data
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 4
Previous Work
• Imprecision versus uncertainty Imprecision is a property of the content of an attribute. Uncertainty concerns the degree of truth of a statement. We handle only imprecision. Our focus is on imprecision in aggregate queries.
• Most work on imprecision deals with relational databases Fuzzy sets - specifies a degree of set membership for a value Partial values - one of a set of values is the true value Multiple imputation - substitute multiple values for a missing value High computational complexity
• Only “incomplete datacubes” have treated imprecision in multidimensional databases.
Imprecision fixed at schema level Imprecision in computations not handled
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 5
A Diabetes Case Study
• E/R schema of case study• Patients have a diagnosis.• Diagnoses may be missing.• Diagnoses may be specified at
the Low-level Diagnosis or Diagnosis Group level.
• HbA1c% (long-term blood sugar level) measured with several methods of varying precision.
DiagnosisDiagnosis
FamilyIs part of
(1,n)(1,1)
* Code* Text
Patient
* Name* SSN* HbA1c%* Precision
(0,1)Has
Low-levelDiagnosis
Diagnosis
D
(0,n)
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 6
Data Model - Schema• Fact type: Patient• Dimension types: Diagnosis
and HbA1c%• There are no “measures”, all
data are dimensions.• Category types: Low-level
Diagnosis, Diagnosis Group, Precise, Imprecise
• Top category types: corresponds to ALL of the dimension.
• Bottom category types: the lowest level in each dimension
• The category types of a dimension type form a lattice.
• Category types ~ granularities Patient
Diagnosis Group
LL Diagnosis
Diagnosis Dimension
Type
HbA1c% Dimension
Type
TDiagnosis THbA1c%
Precise
Imprecise
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 7
Data Model - Instances• Categories: instances of
category types, consist of dimension values.
• Top categories contain only one “T” value.
• Dimensions = categories + partial order on category values
• Facts: instances of fact type with separate identity.
• Fact-dimension relations: links facts to dimensions, may map to values of any granularity.
• Multidimensional object (MO) = schema + dimension + facts + fact-dimension relations
Diagnosis Dimension
HbA1c% Dimension
T
Diabetes
ID Diabetes
NID Diabetes
T
Jim John Jane
5 6 7
5.4 5.5 .. 7.0
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 8
Algebraic Query Language• Close to relational algebra with
aggregation functions Selection, projection, rename,
union, difference, identity-based join operators
• Aggregation operator: takes “grouping categories”, aggregation function and result dimension as arguments.
Groups together facts characterized by the same dimension values, applies aggregation function to groups, and places result in result dimension.
Example: COUNT by Diagnosis Family and THbA1c%
Diagnosis Dimension
Result Dimension
T T
Diabetes
{Jim,John,Jane}
0-2 >2
0 1 2 3 ..
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 9
Overview
• Motivation• Data model and query language context• Handling imprecision• Alternative queries• Imprecision in grouping• Imprecision in computations• Presenting imprecise results• Using pre-aggregated data• Conclusion
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 10
Handling Imprecision• We use the granularity of the
data to capture imprecision.• The dimensions specify an
imprecision hierarchy.• Data is mapped to the
appropriate granularity.• Pseudo-code:
Procedure EvalImprecise(Q,M) if PreciseEnough(Q,M) then Eval(Q,M) else Q’=Alternative(Q,M) if Q’ is ok then Eval(Q’,M) else
Handle Imprecision in Grouping Handle Imprecision in Computation
Return Imprecise Result end if end if
Most Imprecise=T=Unknown
More Imprecise
More precise
Most precise=bottom category
.
.
.
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 11
Alternative Queries• We use a “Precision MO” (PMO)
to capture the data granularities. • The precision MO has a
granularity dimension for every MO dimension D.
• A granularity dimension has two categories: T and GranularityD.
• GranularityD contains a value for every category type in D.
• The set of facts stays the same.• Facts are mapped to the
appropriate GranularityD value.
GranDiagnosis Dimension
GranHbA1c% Dimension
T T
DGLLD TD P I TH
Jim John Jane
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 12
Alternative Queries• Queries are rewritten into
“testing queries” on the PMO that counts the number of facts mapped to different granularity combinations (details in paper).
• The result of the testing queries can be used to suggest alternative queries.
• Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO
Testing query: COUNT by GranularityDiagnosis on PMO
Result shows that data is not precise enough to group on Low-level Diagnosis
Alternative query: COUNT by Diagnosis Group
GranDiagnosis Dimension
Result Dimension
T
LLD DG TD 0 1 2
T
{Jim} {John,Jane}
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 13
Overview
• Motivation• Data model and query language context• Handling imprecision• Alternative queries• Imprecision in grouping• Imprecision in computations• Presenting imprecise results• Using pre-aggregated data• Conclusion
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 14
Imprecision in Grouping
• If facts are mapped to a dimension value of coarser granularity than the grouping category, with which dimension value should they be grouped ? - we do not know !
• So, we return several answers, based on different groupings of facts:
Conservative grouping: only include in a group what is known to belong to it - discard imprecise data.
Liberal grouping: include in a group everything that might belong to it - facts may be in several groups.
Weighted grouping: include everything that might belong, but give more weight to likely members.
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 15
Conservative Grouping• Corresponds to the standard
aggregation operator.• Imprecise data are simply
discarded.• Example: COUNT by Low-level
Diagnosis (and THbA1c%)) on MO Jim does not show in any group. The count for both groups is 1. The result is “too conservative”
as not all data is accounted for in the result.
Diagnosis Dimension
Result Dimension
T T
Diabetes
ID Diabetes
NID Diabetes 0 1 2
{John} {Jane}
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 16
Liberal Grouping• We modify the aggregation
operator from the query language to compute the liberal grouping (formal semantics in the paper).
• The computed groups may now overlap, i.e., the same fact may be in several groups.
• Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO
Jim ends up up both groups. The count for both groups is 2. The result is “too liberal” as the
same data may be counted several times in the result.
Diagnosis Dimension
Result Dimension
T
0 1 2
Diabetes
T
ID Diabetes
NID Diabetes
{Jim,John} {Jim,Jane}
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 17
Weighted Grouping• Compromise between the
conservative and liberal groupings:
Weights assigned to the partial order on dimension values and to fact membership in groups (groups become fuzzy sets).
Aggregation operator modified to compute the weighted grouping (see paper).
• Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO
80% of Diabetes patient have ID Diabetes, 20% NID.
Jim ends up up both groups, but weighted differently.
Result is a weighted COUNT (details on next slide).
Diagnosis Dimension
T
Diabetes
ID Diabetes
NID Diabetes
1.0
.8 .2
{Jim.8,John1.0} {Jim.2,Jane1.0}
Result Dimension
T
0 1.2 1.8
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 18
Imprecision in Computation
• We also need to handle imprecision in the aggregate computation itself, e.g., the varying precision for HbA1c%.
• In computations, we impute the expected value for values of non-bottom granularity (generalized imputation). This allows normal computation of the result.
• A precision computation is performed along with the aggregate computation:
A granularity computation measure (GCM) is used to capture the imprecision of a dimension value during computation.
A measure combination function (MCF) is used to combine GCMs. The MCF must be distributive.
A final granularity measure (FGM) represent the “true” imprecision of a result value.
A final granularity function (FGF) maps from a GCM to a FGM.
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 19
Imprecision in Computation• Example: WAVG(HbA1c%) by
Low-Level Diagnosis We use FGM=WAVG(Level) as
a measure of precision (Precise values have level 0, etc.)
The GCM for a value e with weight w is: (w * Level(e),w)
The MCF is: h((n1, n2),(n3,n4)) = (n1+n3,n2+n4)
The FGF is: f(n1,n2) = n1/n2 6.0 & 7.0 imputed for T & 7 in
WAVG(HbA1c%) computation Jim, John, and Jane have
levels 2, 0, and 1, respectively Results of query: {(IDD,5.7,.9),
(NIDD,5.6,1.2)}
Diagnosis Dimension
HbA1c% Dimension
T T
Diabetes
ID Diabetes
NID Diabetes
Jim John Jane
5 6 7
5.4 5.5 .. 7.0
6.0
7.0
.8 .2
1.0
1
2
0
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 20
Presenting Imprecise Results• Several ways to present the
imprecise results to the user:• Show result values along with
the corresponding final granularity measure values.
Very precise estimate of result precision, but hard to grasp.
Resulting (value,FGM) = {(IDD,5.7,.9),(NIDD,5.6,1.2)}
• Map result values into different granularities using a value coarsening function (VCF).
More intuitive result, but less precise estimate of precision.
Example: Weighted grouping with VCF: r(x) = v such that xv and Level(v) = Ceiling(x).
Diagnosis Dimension
Result Dimension
T
Diabetes
ID Diabetes
NID Diabetes
{Jim.8,John1.0} {Jim.2,Jane1.0}
T
5.4 5.5 .. 7.0
5 6 7
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 21
Using Pre-Aggregated Data
• The approach can utilize pre-aggregated data effectively.• Aggregate results for the precision MO can usually be fully
materialized due to the relatively small multidimensional space (~ 1.000.000 cells).
• The aggregate computation with expected values can use standard pre-aggregation techniques, e.g., partial pre-aggregation, for good response-time versus storage/update-time tradeoffs.
• The distributive MCF allows for partial pre-aggregation of precision results.
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 22
Conclusion
• Current OLAP tools and models do not support the imprecision found in real-world data.
• We have shown an approach to handling imprecision in OLAP databases based on the common multidimensional concept of granularities.
• The approach can suggest alternative queries when data is not precise enough and handles imprecision both in the grouping of data and in the aggregate computation.
• The approach has a low computational overhead and is able to exploit pre-aggregated results effectively.
• The approach can be implemented using existing technology such as SQL and OLAP tools.
• Future work: presentation of results, MIN/MAX functions, etc.