71
By Mahesh R. Sanghavi Associate professor, SNJB’s KBJ CoE, Chandwad

SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

By Mahesh R. Sanghavi Associate professor,

SNJB’s KBJ CoE, Chandwad

Page 2: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• All the content of these PPTs were taken from PPTS of renown author via internet.

• These PPTs are only mean to share the knowledge among the Students.

• I specially thanks to the authors and owner due to whom it is possible to prepare these PPTs.

Page 3: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Introduction: Data Warehouse Modeling

Data Warehouse Design

Data-ware-house Technology

Distributed Data Warehouse

Materialized view

Page 4: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 4

• A data warehouse is the foremost repository for the data available for developing business intelligence architectures and decision support systems.

• The term data warehousing indicates the whole set of interrelated activities involved in designing, implementing and using a data warehouse

• A data warehouse is a relational database for query and analysis.

It contains historical data

It uses OLAP(online analytical processing) engine

It’s Key features are as follows:

• subject-oriented,

• integrated,

• time-variant,

• nonvolatile

Page 5: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• Three main categories of data feeding into a data warehouse

• Internal Data

– Bac-koffice, front office, web-based

• External Data

– Collected from market survey, GIS etc.

• Personal Data

– BI Analyst will produce this data

Page 6: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 6

• OLTP (on-line transaction processing)

– Major task of traditional relational DBMS

– Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

• OLAP (on-line analytical processing)

– Major task of data warehouse system

– Data analysis and decision making

• Distinct features (OLTP vs. OLAP):

– User and system orientation: year vs. market

– Data contents: current, detailed vs. historical, consolicityd

– Database design: ER + application vs. star + subject

– View: current, local vs. evolutionary, integrated

– Access patterns: upcity vs. read-only but complex queries

Page 7: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 7

OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date

detailed, flat relational

isolated

historical,

summarized, multidimensional

integrated, consolidated

usage repetitive ad-hoc

access read/write

index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Page 8: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• Integration • Quality • Efficiency • Extendibility • Entity Oriented • Integrated • Time-variant • Persistent • Consolidated • Denormalized

Page 9: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Data warehouse Operational system Subject oriented Transaction oriented

Large (hundreds of GB up to several TB)

Small (MB up to several GB)

Historic data Current data

De-normalized table structure (few tables, many columns per table)

Normalized table structure (many tables, few columns per table)

Batch updates Continuous updates

Usually very complex queries Simple to complex queries

Page 10: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• Data marts are systems that gather all the data required by a specific company department, such as marketing or logistics, for the purpose of performing business intelligence analyses and executing decision support applications specific to the function itself • Therefore, a data mart can be considered as a functional or departmental data warehouse of a smaller size and a more specific type than the overall company data warehouse.

Page 11: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Contd..

• Data warehouses and data marts thus share the same technological framework. • In order to implement business intelligence applications, some companies prefer to design and develop in an incremental way a series of integrated data marts rather than a central data warehouse, in order to reduce the implementation time and uncertainties connected with the project.

Page 12: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Data Warehouse Data Mart

Enterprise-wide data Department-wide data

multiple subject areas single subject area

difficult to build easy to build

takes more time to build less time to build

larger memory limited memory

Page 13: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• Accuracy • Completeness • Consistency • Timeliness • Non-redundency • Relevance • Interoperability • Accessibility • Accessibility

Page 14: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• The table above is an example of a cross-tabulation (cross-tab), also referred to as a pivot-table.

Page 15: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Cross-tabs can be represented as relations

We use the value all is used to represent aggregates

The SQL:1999 standard actually uses null values in place of all despite confusion with regular null values

Page 16: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

A data cube is a multidimensional generalization of a cross-tab

Can have n dimensions; we show 3 below

Cross-tabs can be used as views on a data cube

Page 17: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 17

• Modeling data warehouses: dimensions & measures

– Star schema: A fact table in the middle connected to a set of

dimension tables

– Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables, forming a shape similar to

snowflake

– Fact constellations: Multiple fact tables share dimension

tables, viewed as a collection of stars, therefore called

galaxy schema or fact constellation

Page 18: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 18

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city

state_or_province

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

item

branch_key

branch_name

branch_type

branch

Page 19: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 19

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_key

item

branch_key

branch_name

branch_type

branch

supplier_key

supplier_type

supplier

city_key

city

state_or_province

country

city

Page 20: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 20

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city

province_or_state

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

item

branch_key

branch_name

branch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key

shipper_name

location_key

shipper_type

shipper

Page 21: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 21

all

Europe North_America

Mexico Canada Spain Germany

Vancouver

M. Wind L. Chan

...

... ...

... ...

...

all

region

office

country

Toronto Frankfurt city

Page 22: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the
Page 23: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Cross-tabs can be easily extended to deal with hierarchies

Multi levels of hierarchy can be displayed in the cross-tab

Can drill down or roll up on a hierarchy

Cross tab without hierarchy

Figure 18.5

Figure 18.1

Page 24: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 25

Total annual sales

of TV in U.S.A. city

Cou

ntr

y

sum

sum TV

VCR PC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Page 25: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 26

• by climbing up hierarchy or by dimension reduction

• summarize data , perform aggregate on data cube Roll up (drill-up):

• from higher level summary to lower level summary or detailed data, or introducing new dimensions

• reverse of roll-up Drill down (roll down):

• Selection of one dimension from cube, resulting in sub cube Slice :

• Defines sub cube by selecting two or more dimension Dice:

• reorient the cube, visualization, 3D to series of 2D planes

• It is a visualization operation Pivot (rotate):

• involving (across) more than one fact table drill across:

• through the bottom level of the cube to its back-end relational tables (using SQL) drill through:

Page 26: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 27

Page 27: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 28

Top-down view

allows selection of the relevant information

necessary for the data warehouse for

future business needs

Data source view

exposes the information being

captured, stored, and managed by

operational systems

Data warehouse view

consists of fact tables and dimension tables

Business query view

sees the perspectives of data in the

warehouse from the view of end-user

Four views regarding the design of a data

warehouse

Page 28: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Choose a business process to model, e.g., orders, invoices, etc.

Choose the grain of the business process. Grain is fundamental level of data to be represented in the fact table e.g. individual transaction, individual daily snap shot

Choose the dimensions that will apply to each fact table record e.g. Item , year , supplier

Choose the measure that will populate each fact table record like dollars_sold , unit_sold

February 3, 2016 Data Mining: Concepts and Techniques 29

Page 29: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 30

• Top-down: Starts with overall design and planning (mature)

• Bottom-up: Starts with experiments and prototypes (rapid)

Top-down, bottom-up

approaches or a combination

of both

• Waterfall: structured and systematic analysis at each step before proceeding to the next

• Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around

From software engineering

point of view

Page 30: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• Two approaches for Centralized Data Warehouse – Inmon’s (Top-Down)

– Kimball’s (Bottom-Up)

In a nutshell

Bill Inmon’s enterprise data warehouse approach (the top-down design): A

normalized data model is designed first. Then the dimensional data marts, which

contain data required for specific business processes or specific departments are

created from the data warehouse.

Ralph Kimball’s dimensional design approach (the bottom-up design): The data marts

facilitating reports and analysis are created first; these are then combined together

to create a broad data warehouse.

Page 32: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Inmon defines the data warehouse in the following terms:

1. Subject-oriented: The data in the data warehouse is organized so that all the

data elements relating to the same real-world event or object are linked

together

2. Time-variant: The changes to the data in the database are tracked and

recorded so that reports can be produced showing changes over time

3. Non-volatile: Data in the data warehouse is never over-written or deleted --

once committed, the data is static, read-only, and retained for future reporting

4. Integrated: The database contains data from most or all of an organization's

operational applications, and that this data is made consistent

Page 34: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• Data marts are created first.

• These provide a thin view into the organizational data, and as and when required these can be combined into a larger data warehouse.

• Kimball defines data warehouse as “A copy of transaction data specifically structured for query and analysis

• Kimball’s data warehousing architecture is also known as Data Warehouse Bus (BUS). Dimensional modeling focuses on ease of end user accessibility and provides a high level of performance to the data warehouse.

Page 35: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the
Page 36: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 37

Data

Warehouse

Extract

Clean

Transform

Load

Refresh

Middle tier

Analysis

Query

Reports

Data mining

Monitor

&

Integrator

Metadata

Data Sources Top tier

Output

Data Marts

Operational

DBs

Other

sources

Bottom Tier

OLAP Server

Page 37: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 38

• get data from multiple, heterogeneous, and external sources Data extraction

• detect errors in the data and rectify them when possible Data cleaning

• convert data from legacy or host format to warehouse format

Data transformation

• sort, summarize, consolicity, compute views, check integrity, and build indicies and partitions Load

• propagate the upcitys from the data sources to the warehouse Refresh

Page 38: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 39

Description of the structure of the data warehouse

•schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents

Operational meta-data

•data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)

The algorithms used for summarization

The mapping from operational environment to the data warehouse

Data related to system performance

• Includes indices and profiles that improve data access

•Rules for timing and scheduling of refresh, upcity and replication

Business metadata

•business terms and definitions, ownership of data, charging policies

Meta data is the data defining warehouse objects.

It stores:

Page 39: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 40

Bottom Tier – Ware house database

server

Middle Tier – OLAP Server (ROLAP or

MOLAP)

Top Tier – (Front-end-client

Layer) – contains reporting tool, analysis tool

or data mining tool

Page 40: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 41

Enterprise Data Warehouse (EDW)

• collects all of the information about subjects spanning the entire organization.

• It contains detailed data as well s summarized data and can range in size from gigabytes to terabytes

• Contains detailed data and require years to design and build

Data Mart (DM)

• a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart may confine its subjects to year, item and sale.

• It is in the summarized form and requires few weeks for implementation

• Data mart can be independent or dependent

Page 41: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Virtual Data Warehouse (VDW)

• A set of views over operational databases

• For efficient query processing , only some of the possible summary views may be materialized

• Easy to build

• Requires excess capacity on operational database servers

February 3, 2016 Data Mining: Concepts and Techniques 42

Page 42: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 43

Relational OLAP (ROLAP)

• Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware

• Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services

• Greater scalability

Multidimensional OLAP (MOLAP)

• Sparse array-based multidimensional storage engine

• Fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)

• Flexibility, e.g., low level: relational, high-level: array

Specialized SQL servers (e.g., Redbricks)

• Specialized support for SQL queries over star/snowflake schemas

Page 43: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Define a high-level corporate data model

Data

Mart

Data

Mart

Distributed

Data Marts

Multi-Tier Data

Warehouse

Enterprise

Data

Warehouse

Model refinement Model refinement

Page 44: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs

Information processing

• multidimensional analysis of data warehouse data

• supports basic OLAP operations, slice-dice, drilling, pivoting

Analytical processing

• knowledge discovery from hidden patterns

• supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools

Data mining

Page 45: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• ETL

–Extraction

–Transformation

–Loading

Page 46: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 47

Page 47: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 48

Page 48: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

To get data out of the source and load it into the data warehouse –

simply a process of copying data from one

database to other

Data is extracted from an OLTP database,

transformed to match the data warehouse schema and loaded

into the data warehouse database

Many data warehouses also incorporate data

from non-OLTP systems such as text files, legacy

systems, and spreadsheets; such data also requires extraction,

transformation, and loading

Page 49: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Data is extracted from heterogeneous data sources

Each data source has its distinct set of characteristics that need to be managed and integrated into the ETL system in order to effectively extract data.

Page 50: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

ETL process needs to effectively integrate systems that have different:

• DBMS

• Operating Systems

• Hardware

• Communication protocols

Need to have a logical data map

before the physical data can be transformed

The logical data map describes the relationship between the

extreme starting points and the

extreme ending points of your ETL

system usually presented in a table

or spreadsheet

Page 51: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• The table type gives us our queue for the ordinal position of our data load processes—first dimensions, then facts.

• The primary purpose of this document is to provide the ETL developer with a clear-cut blueprint of exactly what is expected from the ETL process. This table must depict, without question, the course of action involved in the transformation process

• The transformation can contain anything from the absolute solution to nothing at all. Most often, the transformation can be expressed in SQL. The SQL may or may not be the complete statement

Target Source Transformation

Table Name Column Name Data Type Table Name Column Name

Data Type

Page 52: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Actually changes data and provides guidance whether data can be used for its intended purposes

Performed in staging area

Page 53: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Correct

Unambiguous

Consistent

Complete

Data quality checks are run at 2 places - after extraction and after cleaning and confirming additional check are run at this point

Page 54: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Column Property Enforcement

• Null Values in required columns

• Numeric values that fall outside of expected high and lows

• Cols whose lengths are exceptionally short/long

• Cols with certain values outside of discrete valid value sets

• Adherence to a required pattern/ member of a set of pattern

• Spell check

• Outlier check

Page 55: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• Tables have proper primary and foreign keys

• Obey referential integrity

Structure Enforcement

• Simple business rules

• Logical data checks

Data and Rule value

enforcement

Page 56: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 57

Page 57: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Staged Data

Cleaning And

Confirming Fatal Errors

Stop

Loading

Yes

No

Page 58: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Physically built to have the minimal sets of components

The primary key is a single field containing meaningless unique integer – Surrogate Keys

The DW owns these keys and never allows any other entity to assign them

De-normalized flat tables – all attributes in a dimension must take on a single value in the presence of a dimension primary key.

Should possess one or more other fields that compose the natural key of the dimension

Page 59: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the
Page 60: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

The data loading module consists of all the steps required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table in the proper dimensional format with correct primary keys, correct natural keys, and final descriptive attributes.

Creating and assigning the surrogate keys occur in this module.

Surrogate keys are keys that have no “business” meaning and are solely used to identify a record in the table. Such keys are either database generated (example: Identity in SQL Server, Sequence in Oracle, Sequence/Identity in DB2 UDB etc.)

The table is definitely staged, since it is the object to be loaded into the presentation system of the data warehouse.

Page 61: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• Efficient Data Cube Computation

• Cube operation

• Iceberg

• Indexing

Page 62: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 63

• Data cube can be viewed as a lattice of cuboids

– The bottom-most cuboid is the base cuboid

– The top-most cuboid (apex) contains only one cell

– How many cuboids in an n-dimensional cube with L levels?

• Materialization of data cube

– Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)

– Selection of which cuboids to materialize • Based on size, sharing, access frequency, etc.

)11(

n

ii

LT

Page 63: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 64

• Cube definition and computation in DMQL

define cube sales[item, city, year]: sum(sales_in_dollars)

compute cube sales

• Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)

SELECT item, city, year, SUM (amount)

FROM SALES

CUBE BY item, city, year

• Need compute the following Group-Bys

(city, item, year),

(city, Item),(city, year), (item, year),

(city), (item), (year)

()

(item) (city)

()

(year)

(city, item) (city, year) (item, year)

(city, item, year)

Page 64: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 65

• Computing only the cuboid cells whose count or

other aggregates satisfying the condition like

HAVING COUNT(*) >= minsup

Motivation

Only a small portion of cube cells may be “above the water’’ in a sparse cube

Only calculate “interesting” cells—data above certain threshold

Avoid explosive growth of the cube Suppose 100 dimensions, only 1 base cell. How many aggregate cells

if count >= 1? What about count >= 2?

Page 65: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

February 3, 2016 Data Mining: Concepts and Techniques 66

• Index on a particular column

• Each value in the column has a bit vector: bit-op is fast

• The length of the bit vector: # of records in the base table

• The i-th bit is set if the i-th row of the base table has the value for the indexed column

• not suitable for high cardinality domains

Cust Region Type

C1 Asia Retail

C2 Europe Dealer

C3 Asia Dealer

C4 America Retail

C5 Europe Dealer

RecID Retail Dealer

1 1 0

2 0 1

3 0 1

4 1 0

5 0 1

RecIDAsia Europe America

1 1 0 0

2 0 1 0

3 1 0 0

4 0 0 1

5 0 1 0

Base table Index on Region Index on Type

Page 66: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• Ab Initio

• DMExpress

• Talend

• Infosphere Datastage

• Informatica PowerCenter

• Oracle OWB and DI

February 3, 2016 Data Mining: Concepts and Techniques 67

Page 67: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• A centralized Data Warehouse is a single physical repository which contains integrated data extracted from external information system.

• Advantages :

– Security

– Ease of Management

– Experience

– Performance

– Expandability

– Reliability

– Vendor Dependency

• Disadvantages

– Server fails entire system get collapse

– May not serve for the Large Scale Application

– Costly

Page 68: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Head Quarters

Operational Processing

Site A Site B

Page 69: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• Many organizations have physically distributed databases with extremely large amounts of data.

• Traditionally the data warehouse would be seen as a centralized repository, whereby data from all sources would be imported into that large centralized repository for analysis.

• Nowadays the speed and bandwidth of wide-area computer networks enables a distributed approach, whereby parts of the data may reside in different places, parts being cached and/or replicated for performance reasons, and the system functions to the outside world as a single global access-transparent repository.

• As the amount of data and number of sites grow, this distributed approach becomes crucial, as a single centralized data warehouse importing data from all the sources has obvious scalability limitations.

Page 70: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

• 3 Types

• Local / Global Distributed DW – Serving global businesses where there are local

operations and a central opeations

• Technologically Distributed DW – Where the volume of data is such that it is spread

over multiple physical volumes

• Independently Evolving Distributed DW – It grows-up in uncoordinated manner

Page 71: SNJB’s KBJ CoE, Chandwad...SNJB’s KBJ CoE, Chandwad •All the content of these PPTs were taken from PPTS of renown author via internet. •These PPTs are only mean to share the

Technical Requirement • Managing large amount of data • Managing multiple media • Indexing & Monitoring Data • Efficient loading of Data • Efficient Index Utilization • Data Compaction • Lock Management • Fast Restore