SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

By Mahesh R. Sanghavi Associate professor,

SNJB’s KBJ CoE, Chandwad

• All the content of these PPTs were taken from PPTS of renown author via internet.

• These PPTs are only mean to share the knowledge among the Students.

• I specially thanks to the authors and owner due to whom it is possible to prepare these PPTs.

Introduction: Data Warehouse Modeling

Data Warehouse Design

Data-ware-house Technology

Distributed Data Warehouse

Materialized view

February 3, 2016 Data Mining: Concepts and Techniques 4

• A data warehouse is the foremost repository for the data available for developing business intelligence architectures and decision support systems.

• The term data warehousing indicates the whole set of interrelated activities involved in designing, implementing and using a data warehouse

• A data warehouse is a relational database for query and analysis.

It contains historical data

It uses OLAP(online analytical processing) engine

It’s Key features are as follows:

• subject-oriented,

• integrated,

• time-variant,

• nonvolatile

• Three main categories of data feeding into a data warehouse

• Internal Data

– Bac-koffice, front office, web-based

• External Data

– Collected from market survey, GIS etc.

• Personal Data

– BI Analyst will produce this data


• OLTP (on-line transaction processing)

– Major task of traditional relational DBMS

– Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

• OLAP (on-line analytical processing)

– Major task of data warehouse system

– Data analysis and decision making

• Distinct features (OLTP vs. OLAP):

– User and system orientation: year vs. market

– Data contents: current, detailed vs. historical, consolicityd

– Database design: ER + application vs. star + subject

– View: current, local vs. evolutionary, integrated

– Access patterns: upcity vs. read-only but complex queries


OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date

detailed, flat relational

isolated

historical,

summarized, multidimensional

integrated, consolidated

usage repetitive ad-hoc

access read/write

index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

• Integration • Quality • Efficiency • Extendibility • Entity Oriented • Integrated • Time-variant • Persistent • Consolidated • Denormalized

Data warehouse Operational system Subject oriented Transaction oriented

Large (hundreds of GB up to several TB)

Small (MB up to several GB)

Historic data Current data

De-normalized table structure (few tables, many columns per table)

Normalized table structure (many tables, few columns per table)

Batch updates Continuous updates

Usually very complex queries Simple to complex queries

• Data marts are systems that gather all the data required by a specific company department, such as marketing or logistics, for the purpose of performing business intelligence analyses and executing decision support applications specific to the function itself • Therefore, a data mart can be considered as a functional or departmental data warehouse of a smaller size and a more specific type than the overall company data warehouse.

Contd..

• Data warehouses and data marts thus share the same technological framework. • In order to implement business intelligence applications, some companies prefer to design and develop in an incremental way a series of integrated data marts rather than a central data warehouse, in order to reduce the implementation time and uncertainties connected with the project.

Data Warehouse Data Mart

Enterprise-wide data Department-wide data

multiple subject areas single subject area

difficult to build easy to build

takes more time to build less time to build

larger memory limited memory

• Accuracy • Completeness • Consistency • Timeliness • Non-redundency • Relevance • Interoperability • Accessibility • Accessibility

• The table above is an example of a cross-tabulation (cross-tab), also referred to as a pivot-table.

Cross-tabs can be represented as relations

We use the value all is used to represent aggregates

The SQL:1999 standard actually uses null values in place of all despite confusion with regular null values

A data cube is a multidimensional generalization of a cross-tab

Can have n dimensions; we show 3 below

Cross-tabs can be used as views on a data cube


• Modeling data warehouses: dimensions & measures

– Star schema: A fact table in the middle connected to a set of

dimension tables

– Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables, forming a shape similar to

snowflake

– Fact constellations: Multiple fact tables share dimension

tables, viewed as a collection of stars, therefore called

galaxy schema or fact constellation


time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city

state_or_province

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

item

branch_key

branch_name

branch_type

branch


time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_key

item

branch_key

branch_name

branch_type

branch

supplier_key

supplier_type

supplier

city_key

city

state_or_province

country

city


time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city

province_or_state

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

item

branch_key

branch_name

branch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key

shipper_name

location_key

shipper_type

shipper


all

Europe North_America

Mexico Canada Spain Germany

Vancouver

M. Wind L. Chan

...

... ...

... ...

...

all

region

office

country

Toronto Frankfurt city

Cross-tabs can be easily extended to deal with hierarchies

Multi levels of hierarchy can be displayed in the cross-tab

Can drill down or roll up on a hierarchy

Cross tab without hierarchy

Figure 18.5

Figure 18.1


Total annual sales

of TV in U.S.A. city

Cou

ntr

y

sum

sum TV

VCR PC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum


• by climbing up hierarchy or by dimension reduction

• summarize data , perform aggregate on data cube Roll up (drill-up):

• from higher level summary to lower level summary or detailed data, or introducing new dimensions

• reverse of roll-up Drill down (roll down):

• Selection of one dimension from cube, resulting in sub cube Slice :

• Defines sub cube by selecting two or more dimension Dice:

• reorient the cube, visualization, 3D to series of 2D planes

• It is a visualization operation Pivot (rotate):

• involving (across) more than one fact table drill across:

• through the bottom level of the cube to its back-end relational tables (using SQL) drill through:



Top-down view

allows selection of the relevant information

necessary for the data warehouse for

future business needs

Data source view

exposes the information being

captured, stored, and managed by

operational systems

Data warehouse view

consists of fact tables and dimension tables

Business query view

sees the perspectives of data in the

warehouse from the view of end-user

Four views regarding the design of a data

warehouse

Choose a business process to model, e.g., orders, invoices, etc.

Choose the grain of the business process. Grain is fundamental level of data to be represented in the fact table e.g. individual transaction, individual daily snap shot

Choose the dimensions that will apply to each fact table record e.g. Item , year , supplier

Choose the measure that will populate each fact table record like dollars_sold , unit_sold



• Top-down: Starts with overall design and planning (mature)

• Bottom-up: Starts with experiments and prototypes (rapid)

Top-down, bottom-up

approaches or a combination

of both

• Waterfall: structured and systematic analysis at each step before proceeding to the next

• Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around

From software engineering

point of view

• Two approaches for Centralized Data Warehouse – Inmon’s (Top-Down)

– Kimball’s (Bottom-Up)

In a nutshell

Bill Inmon’s enterprise data warehouse approach (the top-down design): A

normalized data model is designed first. Then the dimensional data marts, which

contain data required for specific business processes or specific departments are

created from the data warehouse.

Ralph Kimball’s dimensional design approach (the bottom-up design): The data marts

facilitating reports and analysis are created first; these are then combined together

to create a broad data warehouse.

http://cdn.ttgtmedia.com/rms/enterpriseApplications/Inmon's Approach.png

Inmon defines the data warehouse in the following terms:

1. Subject-oriented: The data in the data warehouse is organized so that all the

data elements relating to the same real-world event or object are linked

together

2. Time-variant: The changes to the data in the database are tracked and

recorded so that reports can be produced showing changes over time

3. Non-volatile: Data in the data warehouse is never over-written or deleted --

once committed, the data is static, read-only, and retained for future reporting

4. Integrated: The database contains data from most or all of an organization's

operational applications, and that this data is made consistent

http://cdn.ttgtmedia.com/rms/enterpriseApplications/Kimball's Approach.png

• Data marts are created first.

• These provide a thin view into the organizational data, and as and when required these can be combined into a larger data warehouse.

• Kimball defines data warehouse as “A copy of transaction data specifically structured for query and analysis

• Kimball’s data warehousing architecture is also known as Data Warehouse Bus (BUS). Dimensional modeling focuses on ease of end user accessibility and provides a high level of performance to the data warehouse.

http://searchstorage.techtarget.com/definition/bus


Data

Warehouse

Extract

Clean

Transform

Load

Refresh

Middle tier

Analysis

Query

Reports

Data mining

Monitor

&

Integrator

Metadata

Data Sources Top tier

Output

Data Marts

Operational

DBs

Other

sources

Bottom Tier

OLAP Server


• get data from multiple, heterogeneous, and external sources Data extraction

• detect errors in the data and rectify them when possible Data cleaning

• convert data from legacy or host format to warehouse format

Data transformation

• sort, summarize, consolicity, compute views, check integrity, and build indicies and partitions Load

• propagate the upcitys from the data sources to the warehouse Refresh


Description of the structure of the data warehouse

•schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents

Operational meta-data

•data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)

The algorithms used for summarization

The mapping from operational environment to the data warehouse

Data related to system performance

• Includes indices and profiles that improve data access

•Rules for timing and scheduling of refresh, upcity and replication

Business metadata

•business terms and definitions, ownership of data, charging policies

Meta data is the data defining warehouse objects.

It stores:


Bottom Tier – Ware house database

server

Middle Tier – OLAP Server (ROLAP or

MOLAP)

Top Tier – (Front-end-client

Layer) – contains reporting tool, analysis tool

or data mining tool


Enterprise Data Warehouse (EDW)

• collects all of the information about subjects spanning the entire organization.

• It contains detailed data as well s summarized data and can range in size from gigabytes to terabytes

• Contains detailed data and require years to design and build

Data Mart (DM)

• a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart may confine its subjects to year, item and sale.

• It is in the summarized form and requires few weeks for implementation

• Data mart can be independent or dependent

Virtual Data Warehouse (VDW)

• A set of views over operational databases

• For efficient query processing , only some of the possible summary views may be materialized

• Easy to build

• Requires excess capacity on operational database servers



Relational OLAP (ROLAP)

• Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware

• Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services

• Greater scalability

Multidimensional OLAP (MOLAP)

• Sparse array-based multidimensional storage engine

• Fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)

• Flexibility, e.g., low level: relational, high-level: array

Specialized SQL servers (e.g., Redbricks)

• Specialized support for SQL queries over star/snowflake schemas

Define a high-level corporate data model

Data

Mart

Data

Mart

Distributed

Data Marts

Multi-Tier Data

Warehouse

Enterprise

Data

Warehouse

Model refinement Model refinement

• supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs

Information processing

• multidimensional analysis of data warehouse data

• supports basic OLAP operations, slice-dice, drilling, pivoting

Analytical processing

• knowledge discovery from hidden patterns

• supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools

Data mining

• ETL

–Extraction

–Transformation

–Loading



To get data out of the source and load it into the data warehouse –

simply a process of copying data from one

database to other

Data is extracted from an OLTP database,

transformed to match the data warehouse schema and loaded

into the data warehouse database

Many data warehouses also incorporate data

from non-OLTP systems such as text files, legacy

systems, and spreadsheets; such data also requires extraction,

transformation, and loading

Data is extracted from heterogeneous data sources

Each data source has its distinct set of characteristics that need to be managed and integrated into the ETL system in order to effectively extract data.

ETL process needs to effectively integrate systems that have different:

• DBMS

• Operating Systems

• Hardware

• Communication protocols

Need to have a logical data map

before the physical data can be transformed

The logical data map describes the relationship between the

extreme starting points and the

extreme ending points of your ETL

system usually presented in a table

or spreadsheet

• The table type gives us our queue for the ordinal position of our data load processes—first dimensions, then facts.

• The primary purpose of this document is to provide the ETL developer with a clear-cut blueprint of exactly what is expected from the ETL process. This table must depict, without question, the course of action involved in the transformation process

• The transformation can contain anything from the absolute solution to nothing at all. Most often, the transformation can be expressed in SQL. The SQL may or may not be the complete statement

Target Source Transformation

Table Name Column Name Data Type Table Name Column Name

Data Type

Actually changes data and provides guidance whether data can be used for its intended purposes

Performed in staging area

Correct

Unambiguous

Consistent

Complete

Data quality checks are run at 2 places - after extraction and after cleaning and confirming additional check are run at this point

Column Property Enforcement

• Null Values in required columns

• Numeric values that fall outside of expected high and lows

• Cols whose lengths are exceptionally short/long

• Cols with certain values outside of discrete valid value sets

• Adherence to a required pattern/ member of a set of pattern

• Spell check

• Outlier check

• Tables have proper primary and foreign keys

• Obey referential integrity

Structure Enforcement

• Simple business rules

• Logical data checks

Data and Rule value

enforcement


Staged Data

Cleaning And

Confirming Fatal Errors

Stop

Loading

Yes

No

Physically built to have the minimal sets of components

The primary key is a single field containing meaningless unique integer – Surrogate Keys

The DW owns these keys and never allows any other entity to assign them

De-normalized flat tables – all attributes in a dimension must take on a single value in the presence of a dimension primary key.

Should possess one or more other fields that compose the natural key of the dimension

The data loading module consists of all the steps required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table in the proper dimensional format with correct primary keys, correct natural keys, and final descriptive attributes.

Creating and assigning the surrogate keys occur in this module.

Surrogate keys are keys that have no “business” meaning and are solely used to identify a record in the table. Such keys are either database generated (example: Identity in SQL Server, Sequence in Oracle, Sequence/Identity in DB2 UDB etc.)

The table is definitely staged, since it is the object to be loaded into the presentation system of the data warehouse.

• Efficient Data Cube Computation

• Cube operation

• Iceberg

• Indexing


• Data cube can be viewed as a lattice of cuboids

– The bottom-most cuboid is the base cuboid

– The top-most cuboid (apex) contains only one cell

– How many cuboids in an n-dimensional cube with L levels?

• Materialization of data cube

– Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)

– Selection of which cuboids to materialize • Based on size, sharing, access frequency, etc.

)11(

n

ii

LT


• Cube definition and computation in DMQL

define cube sales[item, city, year]: sum(sales_in_dollars)

compute cube sales

• Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)

SELECT item, city, year, SUM (amount)

FROM SALES

CUBE BY item, city, year

• Need compute the following Group-Bys

(city, item, year),

(city, Item),(city, year), (item, year),

(city), (item), (year)

()

(item) (city)

()

(year)

(city, item) (city, year) (item, year)

(city, item, year)


• Computing only the cuboid cells whose count or

other aggregates satisfying the condition like

HAVING COUNT(*) >= minsup

Motivation

Only a small portion of cube cells may be “above the water’’ in a sparse cube

Only calculate “interesting” cells—data above certain threshold

Avoid explosive growth of the cube Suppose 100 dimensions, only 1 base cell. How many aggregate cells

if count >= 1? What about count >= 2?


• Index on a particular column

• Each value in the column has a bit vector: bit-op is fast

• The length of the bit vector: # of records in the base table

• The i-th bit is set if the i-th row of the base table has the value for the indexed column

• not suitable for high cardinality domains

Cust Region Type

C1 Asia Retail

C2 Europe Dealer

C3 Asia Dealer

C4 America Retail

C5 Europe Dealer

RecID Retail Dealer

1 1 0

2 0 1

3 0 1

4 1 0

5 0 1

RecIDAsia Europe America

1 1 0 0

2 0 1 0

3 1 0 0

4 0 0 1

5 0 1 0

Base table Index on Region Index on Type

• Ab Initio

• DMExpress

• Talend

• Infosphere Datastage

• Informatica PowerCenter

• Oracle OWB and DI


• A centralized Data Warehouse is a single physical repository which contains integrated data extracted from external information system.

• Advantages :

– Security

– Ease of Management

– Experience

– Performance

– Expandability

– Reliability

– Vendor Dependency

• Disadvantages

– Server fails entire system get collapse

– May not serve for the Large Scale Application

– Costly

Head Quarters

Operational Processing

Site A Site B

• Many organizations have physically distributed databases with extremely large amounts of data.

• Traditionally the data warehouse would be seen as a centralized repository, whereby data from all sources would be imported into that large centralized repository for analysis.

• Nowadays the speed and bandwidth of wide-area computer networks enables a distributed approach, whereby parts of the data may reside in different places, parts being cached and/or replicated for performance reasons, and the system functions to the outside world as a single global access-transparent repository.

• As the amount of data and number of sites grow, this distributed approach becomes crucial, as a single centralized data warehouse importing data from all the sources has obvious scalability limitations.

• 3 Types

• Local / Global Distributed DW – Serving global businesses where there are local

operations and a central opeations

• Technologically Distributed DW – Where the volume of data is such that it is spread

over multiple physical volumes

• Independently Evolving Distributed DW – It grows-up in uncoordinated manner

Technical Requirement • Managing large amount of data • Managing multiple media • Indexing & Monitoring Data • Efficient loading of Data • Efficient Index Utilization • Data Compaction • Lock Management • Fast Restore

Documents

SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the