Upload
candice-hart
View
223
Download
0
Tags:
Embed Size (px)
Citation preview
Data Modelling and Cleaning
CMPT 455/826 - Week 8, Day 2
Sept-Dec 2009 – w8d2 1
Conceptual Modelling Solutions for the Data
Warehouse
Stefano Rizzi
Sept-Dec 2009 – w8d2 2
Definition 1: Facts
• A fact is a focus of interest – for the decision-making process;
• Typically, it models a set of events – occurring in the enterprise world.
• A fact is graphically represented – by a box with two sections,
• one for the fact name and • one for the measures.
Sept-Dec 2009 – w8d2 3
Guideline 1: Facts
• Concepts represented in the data source – by frequently-updated archives
• are good candidates for facts
• Concepts represented – by almost-static archives
• are not good candidates for facts
Sept-Dec 2009 – w8d2 4
Definition 2: Measure
• A measure is a numerical property of a fact, – and describes one of its quantitative aspects of interests for
analysis.
• Measures are included in the bottom section of the fact.
Sept-Dec 2009 – w8d2 5
Definition 3: Dimension
• A dimension is a fact property with a finite domain and – describes one of its analysis coordinates.
• The set of dimensions of a fact – determines its finest representation granularity???.
• Graphically, dimensions are represented – as circles attached to the fact by straight lines.
Sept-Dec 2009 – w8d2 6
Guideline 2: Dimensions
• At least one of the dimensions of the fact – should represent time, at any granularity.
Sept-Dec 2009 – w8d2 7
Definition 4: Primary Event
• A primary event is an occurrence of a fact, and – is identified by a tuple of values, – one value for each dimension.
• Each primary event is described – by one value for each measure.
Sept-Dec 2009 – w8d2 8
Definition 5: Dimension Attributes
• A dimension attribute is a property, – with a finite domain, – of a dimension.
• Like dimensions, – it is represented by a circle.
Sept-Dec 2009 – w8d2 9
Definition 6: hierarchy
• A hierarchy is a directed tree, – rooted in a dimension, – whose nodes are all the dimension attributes
• that describe that dimension,
– and whose arcs model many-to-one associations• between pairs of dimension attributes.
• Arcs are graphically represented by straight lines.
Sept-Dec 2009 – w8d2 10
Definition 8: Descriptive attribute
• A descriptive attribute specifies a property of a dimension attribute, – to which is related by an x-to-one association.
• Descriptive attributes are not used for aggregation; – they are always leaves of their hierarchy – and are graphically represented by horizontal lines.
Sept-Dec 2009 – w8d2 11
Definition 9: Cross-dimension attributes
• A cross-dimension attribute– is a (either dimension or descriptive) attribute – whose value is determined – by the combination of two or more dimension attributes, – possibly belonging to different hierarchies.
• It is denoted by connecting through a curve line – the arcs that determine it.
Sept-Dec 2009 – w8d2 12
Definition 10: Convergence
• A convergence takes place – when two dimension attributes within a hierarchy – are connected by two or more alternative paths – of many-to-one associations.
• Convergences are represented – by letting two or more arcs converge – on the same dimension attribute.
Sept-Dec 2009 – w8d2 13
Definition 13: Ragged Hierarchy
• A ragged (or incomplete) hierarchy is a hierarchy, – where, for some instances, – the values of one or more attributes are missing – (since undefined or unknown).
• A ragged hierarchy is graphically denoted – by marking with a dash the attributes – whose values may be missing.
Sept-Dec 2009 – w8d2 14
Definition 14: Unbalanced Hierarchy
• An unbalanced (or recursive) hierarchy is a hierarchy – where, though inter-attribute relationships are consistent, – the instances may have different lengths.
• Graphically, it is represented – by introducing a cycle within the hierarchy.
Sept-Dec 2009 – w8d2 15
Definition 15: Additive
• A measure is said to be additive along a dimension – if its values can be aggregated – along the corresponding hierarchy by the sum operator, – otherwise it is called nonadditive.
• A nonadditive measure is nonaggregable – if no other aggregation operator can be used on it.
Sept-Dec 2009 – w8d2 16
Open Issues
• Lack of a standard for conceptual models
• Need for design patterns to support modelling
• Need for a method to model security issues
Sept-Dec 2009 – w8d2 17
Data Cleaning
(Based on Rahm)
Sept-Dec 2009 – w8d2 18
Single source problems
• Lack of appropriate model-specific integrity constraints
– Attribute: illegal values
– Record: uniqueness violation
– Relationship: referential integrity not validated
Sept-Dec 2009 – w8d2 19
Single source problems
• Lack of appropriate application-specific integrity constraints can lead to:
– Attribute problems: • missing values, misspellings, cryptic abbreviations, embedded values,
misfiled values
– Record problems: • violated attribute dependencies, word transpositions, duplicated records,
contradicted records
– Relationship problems: • wrong references
Sept-Dec 2009 – w8d2 20
Multi-source Problems
• In addition to single source problems, there can be:
– overlapping or contradicting data
– schema naming and structural conflicts
– different data types / granularities / interpretations / points in time
Sept-Dec 2009 – w8d2 21
Data Analysis for cleaning
• Using metadata for data profiling – focuses on the instance analysis of individual attributes– derives information
• such as the data type, length, value range, discrete values and their frequency, variance, uniqueness, occurrence of null values, typical string pattern (e.g., for phone numbers)
– providing an exact view of various quality aspects of the attribute
• Data mining – helps discover specific data patterns in large data sets,
• e.g., relationships holding between several attributes
– focuses on so-called descriptive data mining models • including clustering, summarization, association discovery and sequence
Sept-Dec 2009 – w8d2 22
Data transformations
• Can be done via SQL operations– which allows tracking of all transformations– can include
• Extracting values from free-form attributes (attribute split):• Validation and correction:• Standardization• Duplicate elimination
• May require considerable human involvement– some transformations will be more complex than others– some transformations will apply to more or less data
Sept-Dec 2009 – w8d2 23