Data warehouses The multidimensional model · techniques to return an ... implementation techniques ... New indexing techniques have been developed for OLAP systems

Data warehouses Decision support The multidimensional model OLAP queries

Traditional DBMSs are used by organizations for maintaining data to record day to day operations

On-line Transaction Processing (OLTP)

Such DBs have driven the growth of the DBMS industry, and will continue to be important

Organizations are increasingly analyzing current and historical data to identify useful patterns

Market research to target products to particular market segments (including political polling)

Decision support for high level decision-making

On-line Analytic Processing (OLAP)

High level decision making requires an overall view of all aspects of an organization

This can be prohibitively expensive with a large, global, distributed database

Many organizations have created large consolidated data warehouses

Information from several databases is consolidated into a data warehouse

By copying tables from many sources into one location

The focus is on identifying patterns and global analysis, so having data current is not a priority

OperationalDatabases

extract clean

transform load

refresh

DataWarehouse

MetadataRepository

External DataSources

Extract data from operational databases and other sources

Clean data to minimize errors, and fill in missing information where possible

Transform data to reconcile semantic mismatches

Typically done by defining a view over the data sources

Load data by materializing, and storing, the views created in the transformation stage

Sort the data and generate summary information

Partition data and build indexes to increase efficiency

Semantic integration

Eliminate mismatches between data from many sources

▪ e.g. different schemas, currencies

Heterogeneous sources

Data has to be accessed from a variety of sources formats

Load, refresh, purge

Data must be loaded and periodically refreshed

Data that is too old should be purged

Metadata management

The source, loading date, and other information should be maintained for all data in the warehouse

Traditional SQL queries are inadequate for typical decision support queries

WHERE clauses often contain many AND, and OR conditions , SQL handles OR conditions poorly

Many statistical functions are not supported by SQL, so queries must be embedded in host language programs

Many queries involve conditions over time, or aggregation of data over time

Users often need to submit several related queries, SQL does not support optimization of such families of queries

Some systems support querying where the data is considered to be a multidimensional array

Typical queries involve group-by and aggregation operators, complex conditions, and statistical functions

OLAP applications

Some DBMSs are designed to support OLAP queries as well as traditional SQL queries

Relational DBMSs optimized for decision support

A third class of analysis tools supports exploratory data analysis

Identifying interesting patterns in the data, data mining

OLAP applications use ad hoc, complex queries involving group-by and aggregation operators

Typically OLAP queries are considered in terms of a multidimensional data model

The data can be represented as a multidimensional array

The focus is on a collection of numeric measures

Each measure depends on a set of dimensions

Each measure has one value, over each set of possible values for the dimensions

13 8 10 10

12 30 20 50

11 25 8 15

1 2 3

slice where location_id = 1

Measure of sales, dimensions are products, location and time

time_id

product_id

location_id

Multidimensional data can be stored physically in a multidimensional array, MOLAP

The array must be disk-resident and persistent

Or, data can be stored in relations, ROLAP The relation which relates the dimensions to a

particular measure is known as the fact table

Each dimension (location, product, etc.) can have additional attributes stored in a dimension table

Fact tables are usually much larger than dimension tables

Dimensions can be considered as hierarchies Attributes have positions in the hierarchies

In many OLAP applications it is necessary to store information about time Such data often cannot be represented by an SQL date or

timestamp▪ e.g. date, week, month, quarter, year, holiday status, …

week month

quarter

year

date

province

country

city

Star schemas are a common pattern in OLAP DBs

So called because the dimension tables form a star pattern around the fact table

The majority of the data is in the fact table

The fact table should have no redundancy and be in BCNF

To reduce size, system generated dimension identifiers are used as the primary keys of dimensions

▪ e.g. product ID instead of a product name

Dimension tables are often not normalized

As they are static, so update, insert, and delete anomalies are not an issue

Influenced by SQL and by spreadsheets A measure is commonly aggregated over one or more

dimensions e.g.

Find total sales

Find total sales for each city, or state

Find the top five products ranked by sales

Aggregation may be at different levels of a dimension hierarchy

Roll-up total sales by city to get sales by state, or

Drill-down from total sales by state, to get sales by city

▪ It is possible to drill-down on different dimensions

Pivoting sales on location and time produces the total sales for each location and time The result is a cross-

tabulation

Slicing a dataset is an equality selection on a dimension

Dicing a dataset is a range selection

BC Ont total

2001 63 81 144

2002 38 107 145

2003 75 35 110

total 176 223 399

cross-tabulation

The queries shown below give the cross-tabulation shown previously:

SELECT SUM(S.sales)FROM Sales S, Times T, Location LWHERE S.timeID = T. timeID AND S.llocationID = L. locationIDGROUP BY T.year, L.province

SELECT SUM(S.sales)FROM Sales S, Times TWHERE S. timeID = timeIDGROUP BY T.year

SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.locationID = L. locationIDGROUP BY L.province

SELECT SUM(S.sales)FROM Sales S

total

details

sub-totals

The cross-tabulation shown previously is a roll-up on the location and time dimensions

With the sub-totals being roll-ups on time, and location

Each roll-up query corresponds to an SQL GROUP BY query with different grouping criteria

With three dimensions (time, location, and product) how many such queries are there?

If there are k dimensions, there are 2k possible GROUP BY queries generated by pivoting

These can be generated by an SQL 99 CUBE query

SQL 1999 extended the GROUP BY construct to provide better OLAP support

The GROUP BY clause with the CUBE keyword is equivalent to a collection of GROUP BY statements▪ With one GROUP BY statement for each subset of the k dimensions

The ROLLUP keyword can also be used, which distinguishes between the GROUP BY attributes▪ Subsets where the first attribute is null are not included in the result

(except where all attributes are null)

T.Year L.province SUM(s.sales)

2001 BC 63

2201 Ont 81

2001 null 144

2002 BC 38

2002 Ont 107

2002 null 145

2003 BC 75

2003 Ont 35

2003 null 110

null BC 176

null Ont 223

null null 399

SELECT t.year, L.state, SUM (S.sales)

FROM Sales S, Times T, Location L

WHERE S.timeID = T.timeID AND S.locationID = L.locationID

GROUP BY CUBE (T.year, L.province)

not included in the equivalent rollup query (GROUP BY ROLLUP)

The eight grouping lists for the set product, time and location are shown below

Note that the CUBE operation calculates the specified groupings and it's child nodes

{product, location, time}

{product, location} {product, time} {location, time}

{product} {location} {time}

{}

Trend analysis examples

Find the percentage change in the total monthly sales of each

product

Find the top five products ranked by total sales

Find the trailing n day moving average of sales

▪ For each day compute the average daily sales over the preceding n days

The first two queries are hard to express in SQL

The third is imposible if n is a parameter of the query

The SQL 1999 WINDOW clause allows such queries,

over a table viewed as a sequence

The FROM and WHERE clause is a normal SQL query

Call this query Temp

Temp is the sales rows with attributes of time and location

The subquery defines the window

SELECT L.province, T.month, AVG(S.sales) OVER W AS movavg

FROM Sales S, Time T, Location L

WHERE S.time_id = T.time_id AND S.loc_id = L.loc_id

WINDOW W AS

(PARTITION BY L.province

ORDER BY T.month

RANGE BETWEEN INTERVAL '1' MONTH PRECEDING

AND INTERVAL '1' MONTH FOLLOWING)

Temp is partitioned by the PARTION BY clause

The result has one row for each row in the partition, rather than one row for each partition

The partitions are sorted by the ORDER BY clause The WINDOW clause makes groups nearby records

Value based, using RANGE (like the example)

Based on the number of rows, using the ROWS clause

Compute the aggregate function for each row, and its corresponding group, i.e. its window

There are new aggregate functions, RANK, and variants

In analyzing trends it is common to want to find the 10 (or n) best, or worst performers

e.g., which the top ten best selling products▪ Sort sales by each product

▪ Return answer in descending order

If there are a million products this is very wasteful

Instead guess (!) a sales value, c

All top ten performers are better than c,

But many more are less than c

Add the selection sales > c

The OPTIMIZE construct is supported by some DBMSs

The cutoff value (c) is chosen by the optimizer

Choosing the cutoff can be tricky

The effectiveness of the approach depends on how accurately

the cutoff can be estimated

SELECT P.productID, P.productName, S.sales

FROM Product P, Sales S

WHERE P. productID = S. productID AND

S.locationID = 1 AND S.timeID = 3

ORDER BY S.sales DESC

OPTIMIZE FOR 10 ROWS

This query may be expensive if the tables are large

If speed is of the essence it is possible to return data before the query is complete Either return the current running

average

Or use sampling and other statistical techniques to return an approximation

Note that the algorithms must be non-blocking

SELECT L.province, AVG(S.sales)

FROM Sales S, Location L

WHERE S.loc_id = L.loc_id

GROUP BY L.province

The OLAP environment motivates some different implementation techniques

Indexing is very important in OLAP systems

Interactive response time is desired for queries over very large database

OLAP systems are mostly read, and rarely updated which reduces the cost of maintaining indexes

New indexing techniques have been developed for OLAP systems

Bitmap indexes

Join indexes

A bitmap index can speed up queries on sparse columns, that have few possible values

One bit is allocated for each possible value

The indexes can be used to answer some queries

How many male customers have a rating of 3?

AND the M and 3 columns and count the 1s

M F id name sex rating 1 2 3 4 5

1 0 112 Sam M 5 0 0 0 0 1

0 1 113 Sue F 3 0 0 1 0 0

0 1 121 Ann F 2 0 1 0 0 0

1 0 131 Bob M 3 0 0 1 0 0

rating index

sex index

Joins are often expensive operations Join indexes can be built to speed up specific join

queries

A join index contains record IDs of matching records from different tables

e.g. Sales, products and locations of all sales in B.C.

The index would contain the sales record IDs and their matching product and location rids

▪ Only locations where province = "BC" are included

The number of such indexes can be a problem where there are many similar queries

To reduce the number of join indexes separate indexes can be created on selected columns

With record IDs of dimension table records that meet the condition, and record IDs of matching fact table records

The separate join indexes have to be combined, using record ID intersection, to compute a join query

The intersection can be performed more efficiently if the new indexes are bitmap indexes

Particularly if the selection columns are sparse

The result is a bitmapped join index

Decision support is a rapidly growing area of database use and research

It involves the creation of large, consolidated data repositories called data warehouses

Warehouses are queried using sophisticated analysis techniques

Complex multidimensional queries influenced by both SQL and spreadsheets

New techniques for database design, indexing, view maintenance and querying must be supported

Data mining consists of finding interesting trends in large datasets

Related to an area of statistics - exploratory data analysis

Such patterns should be identified with minimal user input

The knowledge discovery process is in four steps

Data selection - find the target subset of the data

Data cleaning - remove noise and outliers, transform fields to common units and prepare the data for analysis

Data mining - apply data mining algorithms to find interesting trends or patterns

Evaluation - present results to end users

A market basket is a collection of items purchased by a customer in a single transaction

Retailers commonly want to know which items are purchased together, to identify marketing opportunities

An itemset is a set of items bought in a transaction The support of an itemset is the fraction of transactions

that contain all the items in an itemset

e.g. if {milk, cookies} has 60% support then 60% of all transactions contain both milk and cookies

We may be interested in single item itemsets as they identify frequently purchased items

The a priori property is that every subset of a frequent

itemset is also a frequent itemset

The algorithm can proceed iteratively by first identifying frequent

itemsets with only one item

Each single item itemset can then be extended with another item to

generate larger candidate itemsets

Each iteration of the algorithm scans the transactions once

Increasing the candidate itemsets by one item

▪ The algorithm can be improved by only considering additional items that are

themselves itemsets

The minimum support level has to be specified by the user

If the number of {custpmerID, item} pairs is large the relation may have to be sorted or hashed, but

The result set is probably small, just the tip of the iceberg

The query will waste time computing all of the groups, even though only few will meet the HAVING condition

A modification of the priori property suggests that we only need to consider

Customers that have purchased 5 items, and

Items that have been purchased 5 times

SELECT P.customerID, P.item, SUM(P.quantity)FROM Purchases PGROUP BY P.customerID, P.itemHAVING SUM (P.qty) > 5

Find items that customers have purchased more than five times

An association rule is a rule of the form

{milk} {cookies}, which states that:

If milk is purchased in a transaction then it is likely that cookies are also purchased in that transaction

There are two measures for association rules

The support is the percentage of transactions that contain the {LHS RHS}

▪ The same as the support for that itemset

The confidence is a measure of the strength of the rule

▪ The percentage of times that cookies are purchased whenever milk is purchased, or

▪ support ({LHS RHS}) / support ({LHS})

Users ask for association rules with given minimum support and confidence

First all frequent itemsets with the specified minimum support are found▪ As discussed previously

Once the frequent itemsets have been produced they are divided into LHS and RHS

The confidence measure is then tested for each possible LHS and RHS combination of the qualifying itemsets

The most expensive part of the algorithm is identifying the frequent itemsets

Association rules can be applied to sets of days

By using the date field as a grouping attribute

In calendric market basket analysis the user specifies a group of dates, or calendar, to test the rule over

Sequential patterns can be analyzed, where a customer purchases a given sequence of itemsets

Care must be taken when using association rules for predictive purposes

Rules like {milk} {cookies} may not be causal

Finding causal relationships can be hard

Although two events (or purchases) are correlated there may not be a causal relationship between them

Each possible combination of causal relationships can be considered as a model of the world

Assign a score to each model based on its consistency with the observed data

Bayesian networks are graphs that can be used to describe such models▪ The number of models is exponential in the number of variables so

some subset should be considered

An insurance company may want to predict whether or not customers are high risk

What information can they use to do this?

▪ e.g. If a male aged between 16 and 25 drives a truck they are high risk

There is one attribute whose value is to be predicted, the dependent attribute, and

A number of predictor attributes

The general form of such rules is:

P1(X1) P2(X2) … Pk(Xk)Y = c

▪ Pis are predicates, Xis are predictors

The form of predicates depends on the type of the predictor attribute If the attribute is numerical, then numerical computations can

be performed▪ The predicate is of the form low Xi high

If the attribute is categorical, then we must test to see if two values are equal▪ The predicate is of the form Xi {v1, …, vj}

age is numerical, car type and risk rating are categorical▪ (16 age 25) car {truck} hiRisk = true

Rules are based on the dependent attribute type Categorical – classification rules

Numerical – regression rules

Support and confidence can be defined for classification and regression

Support – the support for a condition, C, is the percentage of records that satisfy C

▪ The support for C1 C2 is the support for C1 C2

Confidence – the confidence is the percentage of records of C1 that also satisfy condition C2

Classification and regression rules differ from association rules

They consider more than one set-valued field as the left hand side of the rule

A collection of classification rules can be represented as a decision tree Each internal node of the tree is labeled with a predictor

attribute, referred to as a splitting attribute

Outgoing edges are labeled with predicates

Leaf nodes are given labels values of dependent attributes Decision trees are constructed in two phases

In the growth phase an overly large tree is built▪ Trees are built by repeatedly splitting the tree on the best remaining

splitting criterion

▪ The database is then partitioned on this criterion

The tree is then pruned to remove overspecialized rules

The goal is to partition a set of records into groups

Records in a group are similar, and records that belong to different groups are dissimilar

Each group is called a cluster, and records should belong to only one cluster

Partitional clustering algorithms partitions the data into groups based on some criterion

Hierarchical clustering algorithms generate a sequence of partitions

In the first partition each cluster consists of one record, the algorithm then mergers two partitions in each step

Documents

Data warehouses The multidimensional model · techniques to return an ... implementation techniques ... New indexing techniques have been developed for OLAP systems