Program Pelatihan Tenaga Infromasi dan Informatika Sistem Informasi Kesehatan Ari Cahyono

Preview:

Citation preview

Data Warehousing Design

Program Pelatihan Tenaga Infromasi dan Informatika Sistem Informasi Kesehatan

Ari Cahyono

Learning Objectives

The issues associated with designing a data warehouse database

A technique for designing a data warehouse database called dimensionality modeling

How a dimensionality modeling differs from a an Entity-Relationship (ER) model.

A step-by-step methodology for designing a data warehouse database.

Criteria for assessing the degree of dimensionality provided by a data warehouse.

Designing a Data Warehouse Database

Highly complex. Beginning with answering questions

such as: Which user requirement are most

important and which data should be considered first?

Should the project be scaled down into something more manageable yet at the same time provide an infrastructure capable of ultimately delivering a full-scale enterprise-wide data warehouse?

Common Solution Data marts.

Dimensionality Modeling

A logical design technique that aims to present the data in a standart, intuitive from that allows for high-performance access.

Dimensionality modeling uses the concept of ER modeling with some important restrictions, i.e: Every Dimension Model (DM) is composed of:▪ Fact Table▪ One tabel with a composite primary key

▪ Dimension Table▪ Has a simple (non-composite) primary key that corresponds exactly

to one of the components of the composite key in the fact table.

Star Schema

Star Schema

The star schema exploit the characteristics of factual data such that facts are generated by events that occurred in the past, and are unlikely to change, regardless of how they are analyzed.

Aka star join

“A logical structure that has a fact table containing factual

data in the centre, surrounded by dimension tables containing reference data (which can be

normalized)”

Snowflake schema

A variant of the star schema where dimension tables do not contain denormalized data

Starflake schema

A hybrid structure that contains a mixture of star and snowflake schemas

Nine-Step Methodology by Kimball (1996)

1. Choosing the process

2. Choosing the grain

3. Identifying and conforming the dimensions

4. Choosing the facts

5. Storing pre-calculations in the fact table

6. Rounding out the dimension tables

7. Choosing the duration of the database

8. Tracking slowly changing dimensions

9. Deciding the query priorities and the query modes

Comparison of DM and ER Model ER modeling is a technique for identifying

relationships among entities. Goal: to remove redundancy in the data Unefficient for ad-hoc end-user queries. Traditional ER modeling does not support the main

attraction of data warehousing, namely Intuitive and High performance retriaval of data

A single ER model normally decomposes into multiple DMs.

The multiple DMs are then associated through ‘shared’ dimensions tables.

Step 1: Choosing the process

The process (function) refers to the subject of a particular data mart.

Choose the main entities and relationship

Step 2: Chosing the grain

Deciding exactly what a fact table record represents. e.g. ProductSales individual product

sales Only when the grain for the fact

table is chosen can we identify the dimensions of the fact table.

Step 3: Identifying and conforming the dimension

Dimensions set the context for asking questions about the facts in the table.

A well-built set of dimensions makes the data mart understandable and easy to use.

Identify dimensions in sufficient detail to describe things. A poorly presented of incomplete set of dimensions

will reduce the usefulness of a data mart to an enterprise

If any dimensions occurs in two data marts, they must be exactly the same dimension, or one must be a mathematical subset of the other.

Step 4: Choosing the facts

The grain of the fact table determines which facts can be used in the data mart.

All the facts must be expressed at the level implied by the grain.

Additional facts can be added to a fact table at any time provided they are consistent with the grain of the table.

Step 5: Storing pre-calculation in the fact table

Add derivative valuable information that can be calculated from the other facts.

Step 6: Rounding out the dimension tables

Add as many text descriptions to the dimensions as possible.

The text descriptions should be as intuitive and understandable to users as possible.

The usefulness of a data mart is determined by the scope and nature of the attributes of the dimension table.

Step 7: Choosing the duration of the database

The duration measures how far back in time the fact table goes.

Step 8: Tracking slowly changing dimensions

Three types of SCD: 1. Where a changed dimension attribute is overwritten. 2. where a changed dimension attribute causes a new

dimension record to be created

3. where a changed dimension attribute causes an alternate attribute to be created so that both the old and new values of attribute are simulataneously accesible in the same dimension record

.

Step 9: Deciding the query priorities and the query modes

Consider physical design issues. Physical sort order of fact table on disk

and the presence of pre-stored summaries or aggregations.

Addministration, backup, indexing performance, and security.

Inside Fact Tables

Fact tables is where we keep the measurements. We may keep the details at the lowest

possible level.▪ In the department store fact table for sales

analysis, we may keep the units sold by individual transactions at the cashier’s checkout. ▪ Some fact tables may just contain summary

data called aggregate fact tables.

Fact Tables’ Characteristics

Concatenated Fact Table Key

Grain or level of data Identified Data grain is the level of detail for the measurements or

metrics Fully additive measures Semi-additive measures Large number of records

Table Deep, Not Wide Only a few attributes Sparsity of data Degenerate dimensions

A Denegenerate dimension doesn’t have a dimension key

Degenerate Dimensions

Look closely at attributes of order_number and order_line. These are not measures or metrics or

facts Attributes that are neither fatcs nor

strictly dimension attributes. E.g, reference number like order numbers, invoice numbers, order line numbers.

Example usage: looking for average number of products per order.

Factless Fact Tables

Fact tables that really do not need to contain fatcs. They are “factless” fact tables.

e.g. analyzing student attendance:

Moving a rapidly changing dimension attribute to the fact table as a degenerate dimension column

Date Dimensions

Aggregate Facts Tables

Aggregating Fact Tables

Dimensions Hierarchies

Forming Aggregate Fact Tables

Hierarchies of the store, customer, and product dimensions

Example: Inpatient Service

Recommended