Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
By Mahesh R. Sanghavi Associate professor,
SNJB’s KBJ CoE, Chandwad
• All the content of these PPTs were taken from PPTS of renown author via internet.
• These PPTs are only mean to share the knowledge among the Students.
• I specially thanks to the authors and owner due to whom it is possible to prepare these PPTs.
Introduction: Data Warehouse Modeling
Data Warehouse Design
Data-ware-house Technology
Distributed Data Warehouse
Materialized view
February 3, 2016 Data Mining: Concepts and Techniques 4
• A data warehouse is the foremost repository for the data available for developing business intelligence architectures and decision support systems.
• The term data warehousing indicates the whole set of interrelated activities involved in designing, implementing and using a data warehouse
• A data warehouse is a relational database for query and analysis.
It contains historical data
It uses OLAP(online analytical processing) engine
It’s Key features are as follows:
• subject-oriented,
• integrated,
• time-variant,
• nonvolatile
• Three main categories of data feeding into a data warehouse
• Internal Data
– Bac-koffice, front office, web-based
• External Data
– Collected from market survey, GIS etc.
• Personal Data
– BI Analyst will produce this data
February 3, 2016 Data Mining: Concepts and Techniques 6
• OLTP (on-line transaction processing)
– Major task of traditional relational DBMS
– Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
• OLAP (on-line analytical processing)
– Major task of data warehouse system
– Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
– User and system orientation: year vs. market
– Data contents: current, detailed vs. historical, consolicityd
– Database design: ER + application vs. star + subject
– View: current, local vs. evolutionary, integrated
– Access patterns: upcity vs. read-only but complex queries
February 3, 2016 Data Mining: Concepts and Techniques 7
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
• Integration • Quality • Efficiency • Extendibility • Entity Oriented • Integrated • Time-variant • Persistent • Consolidated • Denormalized
Data warehouse Operational system Subject oriented Transaction oriented
Large (hundreds of GB up to several TB)
Small (MB up to several GB)
Historic data Current data
De-normalized table structure (few tables, many columns per table)
Normalized table structure (many tables, few columns per table)
Batch updates Continuous updates
Usually very complex queries Simple to complex queries
• Data marts are systems that gather all the data required by a specific company department, such as marketing or logistics, for the purpose of performing business intelligence analyses and executing decision support applications specific to the function itself • Therefore, a data mart can be considered as a functional or departmental data warehouse of a smaller size and a more specific type than the overall company data warehouse.
Contd..
• Data warehouses and data marts thus share the same technological framework. • In order to implement business intelligence applications, some companies prefer to design and develop in an incremental way a series of integrated data marts rather than a central data warehouse, in order to reduce the implementation time and uncertainties connected with the project.
Data Warehouse Data Mart
Enterprise-wide data Department-wide data
multiple subject areas single subject area
difficult to build easy to build
takes more time to build less time to build
larger memory limited memory
• Accuracy • Completeness • Consistency • Timeliness • Non-redundency • Relevance • Interoperability • Accessibility • Accessibility
• The table above is an example of a cross-tabulation (cross-tab), also referred to as a pivot-table.
Cross-tabs can be represented as relations
We use the value all is used to represent aggregates
The SQL:1999 standard actually uses null values in place of all despite confusion with regular null values
A data cube is a multidimensional generalization of a cross-tab
Can have n dimensions; we show 3 below
Cross-tabs can be used as views on a data cube
February 3, 2016 Data Mining: Concepts and Techniques 17
• Modeling data warehouses: dimensions & measures
– Star schema: A fact table in the middle connected to a set of
dimension tables
– Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
– Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
February 3, 2016 Data Mining: Concepts and Techniques 18
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
state_or_province
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
February 3, 2016 Data Mining: Concepts and Techniques 19
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
state_or_province
country
city
February 3, 2016 Data Mining: Concepts and Techniques 20
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_state
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
February 3, 2016 Data Mining: Concepts and Techniques 21
all
Europe North_America
Mexico Canada Spain Germany
Vancouver
M. Wind L. Chan
...
... ...
... ...
...
all
region
office
country
Toronto Frankfurt city
Cross-tabs can be easily extended to deal with hierarchies
Multi levels of hierarchy can be displayed in the cross-tab
Can drill down or roll up on a hierarchy
Cross tab without hierarchy
Figure 18.5
Figure 18.1
February 3, 2016 Data Mining: Concepts and Techniques 25
Total annual sales
of TV in U.S.A. city
Cou
ntr
y
sum
sum TV
VCR PC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
February 3, 2016 Data Mining: Concepts and Techniques 26
• by climbing up hierarchy or by dimension reduction
• summarize data , perform aggregate on data cube Roll up (drill-up):
• from higher level summary to lower level summary or detailed data, or introducing new dimensions
• reverse of roll-up Drill down (roll down):
• Selection of one dimension from cube, resulting in sub cube Slice :
• Defines sub cube by selecting two or more dimension Dice:
• reorient the cube, visualization, 3D to series of 2D planes
• It is a visualization operation Pivot (rotate):
• involving (across) more than one fact table drill across:
• through the bottom level of the cube to its back-end relational tables (using SQL) drill through:
February 3, 2016 Data Mining: Concepts and Techniques 27
February 3, 2016 Data Mining: Concepts and Techniques 28
Top-down view
allows selection of the relevant information
necessary for the data warehouse for
future business needs
Data source view
exposes the information being
captured, stored, and managed by
operational systems
Data warehouse view
consists of fact tables and dimension tables
Business query view
sees the perspectives of data in the
warehouse from the view of end-user
Four views regarding the design of a data
warehouse
Choose a business process to model, e.g., orders, invoices, etc.
Choose the grain of the business process. Grain is fundamental level of data to be represented in the fact table e.g. individual transaction, individual daily snap shot
Choose the dimensions that will apply to each fact table record e.g. Item , year , supplier
Choose the measure that will populate each fact table record like dollars_sold , unit_sold
February 3, 2016 Data Mining: Concepts and Techniques 29
February 3, 2016 Data Mining: Concepts and Techniques 30
• Top-down: Starts with overall design and planning (mature)
• Bottom-up: Starts with experiments and prototypes (rapid)
Top-down, bottom-up
approaches or a combination
of both
• Waterfall: structured and systematic analysis at each step before proceeding to the next
• Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around
From software engineering
point of view
• Two approaches for Centralized Data Warehouse – Inmon’s (Top-Down)
– Kimball’s (Bottom-Up)
In a nutshell
Bill Inmon’s enterprise data warehouse approach (the top-down design): A
normalized data model is designed first. Then the dimensional data marts, which
contain data required for specific business processes or specific departments are
created from the data warehouse.
Ralph Kimball’s dimensional design approach (the bottom-up design): The data marts
facilitating reports and analysis are created first; these are then combined together
to create a broad data warehouse.
Inmon defines the data warehouse in the following terms:
1. Subject-oriented: The data in the data warehouse is organized so that all the
data elements relating to the same real-world event or object are linked
together
2. Time-variant: The changes to the data in the database are tracked and
recorded so that reports can be produced showing changes over time
3. Non-volatile: Data in the data warehouse is never over-written or deleted --
once committed, the data is static, read-only, and retained for future reporting
4. Integrated: The database contains data from most or all of an organization's
operational applications, and that this data is made consistent
• Data marts are created first.
• These provide a thin view into the organizational data, and as and when required these can be combined into a larger data warehouse.
• Kimball defines data warehouse as “A copy of transaction data specifically structured for query and analysis
• Kimball’s data warehousing architecture is also known as Data Warehouse Bus (BUS). Dimensional modeling focuses on ease of end user accessibility and provides a high level of performance to the data warehouse.
February 3, 2016 Data Mining: Concepts and Techniques 37
Data
Warehouse
Extract
Clean
Transform
Load
Refresh
Middle tier
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Top tier
Output
Data Marts
Operational
DBs
Other
sources
Bottom Tier
OLAP Server
February 3, 2016 Data Mining: Concepts and Techniques 38
• get data from multiple, heterogeneous, and external sources Data extraction
• detect errors in the data and rectify them when possible Data cleaning
• convert data from legacy or host format to warehouse format
Data transformation
• sort, summarize, consolicity, compute views, check integrity, and build indicies and partitions Load
• propagate the upcitys from the data sources to the warehouse Refresh
February 3, 2016 Data Mining: Concepts and Techniques 39
Description of the structure of the data warehouse
•schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents
Operational meta-data
•data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
• Includes indices and profiles that improve data access
•Rules for timing and scheduling of refresh, upcity and replication
Business metadata
•business terms and definitions, ownership of data, charging policies
Meta data is the data defining warehouse objects.
It stores:
February 3, 2016 Data Mining: Concepts and Techniques 40
Bottom Tier – Ware house database
server
Middle Tier – OLAP Server (ROLAP or
MOLAP)
Top Tier – (Front-end-client
Layer) – contains reporting tool, analysis tool
or data mining tool
February 3, 2016 Data Mining: Concepts and Techniques 41
Enterprise Data Warehouse (EDW)
• collects all of the information about subjects spanning the entire organization.
• It contains detailed data as well s summarized data and can range in size from gigabytes to terabytes
• Contains detailed data and require years to design and build
Data Mart (DM)
• a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart may confine its subjects to year, item and sale.
• It is in the summarized form and requires few weeks for implementation
• Data mart can be independent or dependent
Virtual Data Warehouse (VDW)
• A set of views over operational databases
• For efficient query processing , only some of the possible summary views may be materialized
• Easy to build
• Requires excess capacity on operational database servers
February 3, 2016 Data Mining: Concepts and Techniques 42
February 3, 2016 Data Mining: Concepts and Techniques 43
Relational OLAP (ROLAP)
• Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware
• Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services
• Greater scalability
Multidimensional OLAP (MOLAP)
• Sparse array-based multidimensional storage engine
• Fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
• Flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers (e.g., Redbricks)
• Specialized support for SQL queries over star/snowflake schemas
Define a high-level corporate data model
Data
Mart
Data
Mart
Distributed
Data Marts
Multi-Tier Data
Warehouse
Enterprise
Data
Warehouse
Model refinement Model refinement
• supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs
Information processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
Analytical processing
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools
Data mining
• ETL
–Extraction
–Transformation
–Loading
February 3, 2016 Data Mining: Concepts and Techniques 47
February 3, 2016 Data Mining: Concepts and Techniques 48
To get data out of the source and load it into the data warehouse –
simply a process of copying data from one
database to other
Data is extracted from an OLTP database,
transformed to match the data warehouse schema and loaded
into the data warehouse database
Many data warehouses also incorporate data
from non-OLTP systems such as text files, legacy
systems, and spreadsheets; such data also requires extraction,
transformation, and loading
Data is extracted from heterogeneous data sources
Each data source has its distinct set of characteristics that need to be managed and integrated into the ETL system in order to effectively extract data.
ETL process needs to effectively integrate systems that have different:
• DBMS
• Operating Systems
• Hardware
• Communication protocols
Need to have a logical data map
before the physical data can be transformed
The logical data map describes the relationship between the
extreme starting points and the
extreme ending points of your ETL
system usually presented in a table
or spreadsheet
• The table type gives us our queue for the ordinal position of our data load processes—first dimensions, then facts.
• The primary purpose of this document is to provide the ETL developer with a clear-cut blueprint of exactly what is expected from the ETL process. This table must depict, without question, the course of action involved in the transformation process
• The transformation can contain anything from the absolute solution to nothing at all. Most often, the transformation can be expressed in SQL. The SQL may or may not be the complete statement
Target Source Transformation
Table Name Column Name Data Type Table Name Column Name
Data Type
Actually changes data and provides guidance whether data can be used for its intended purposes
Performed in staging area
Correct
Unambiguous
Consistent
Complete
Data quality checks are run at 2 places - after extraction and after cleaning and confirming additional check are run at this point
Column Property Enforcement
• Null Values in required columns
• Numeric values that fall outside of expected high and lows
• Cols whose lengths are exceptionally short/long
• Cols with certain values outside of discrete valid value sets
• Adherence to a required pattern/ member of a set of pattern
• Spell check
• Outlier check
• Tables have proper primary and foreign keys
• Obey referential integrity
Structure Enforcement
• Simple business rules
• Logical data checks
Data and Rule value
enforcement
February 3, 2016 Data Mining: Concepts and Techniques 57
Staged Data
Cleaning And
Confirming Fatal Errors
Stop
Loading
Yes
No
Physically built to have the minimal sets of components
The primary key is a single field containing meaningless unique integer – Surrogate Keys
The DW owns these keys and never allows any other entity to assign them
De-normalized flat tables – all attributes in a dimension must take on a single value in the presence of a dimension primary key.
Should possess one or more other fields that compose the natural key of the dimension
The data loading module consists of all the steps required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table in the proper dimensional format with correct primary keys, correct natural keys, and final descriptive attributes.
Creating and assigning the surrogate keys occur in this module.
Surrogate keys are keys that have no “business” meaning and are solely used to identify a record in the table. Such keys are either database generated (example: Identity in SQL Server, Sequence in Oracle, Sequence/Identity in DB2 UDB etc.)
The table is definitely staged, since it is the object to be loaded into the presentation system of the data warehouse.
• Efficient Data Cube Computation
• Cube operation
• Iceberg
• Indexing
February 3, 2016 Data Mining: Concepts and Techniques 63
• Data cube can be viewed as a lattice of cuboids
– The bottom-most cuboid is the base cuboid
– The top-most cuboid (apex) contains only one cell
– How many cuboids in an n-dimensional cube with L levels?
• Materialization of data cube
– Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)
– Selection of which cuboids to materialize • Based on size, sharing, access frequency, etc.
)11(
n
ii
LT
February 3, 2016 Data Mining: Concepts and Techniques 64
• Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales
• Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year
• Need compute the following Group-Bys
(city, item, year),
(city, Item),(city, year), (item, year),
(city), (item), (year)
()
(item) (city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)
February 3, 2016 Data Mining: Concepts and Techniques 65
• Computing only the cuboid cells whose count or
other aggregates satisfying the condition like
HAVING COUNT(*) >= minsup
Motivation
Only a small portion of cube cells may be “above the water’’ in a sparse cube
Only calculate “interesting” cells—data above certain threshold
Avoid explosive growth of the cube Suppose 100 dimensions, only 1 base cell. How many aggregate cells
if count >= 1? What about count >= 2?
February 3, 2016 Data Mining: Concepts and Techniques 66
• Index on a particular column
• Each value in the column has a bit vector: bit-op is fast
• The length of the bit vector: # of records in the base table
• The i-th bit is set if the i-th row of the base table has the value for the indexed column
• not suitable for high cardinality domains
Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
RecID Retail Dealer
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
RecIDAsia Europe America
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
5 0 1 0
Base table Index on Region Index on Type
• Ab Initio
• DMExpress
• Talend
• Infosphere Datastage
• Informatica PowerCenter
• Oracle OWB and DI
February 3, 2016 Data Mining: Concepts and Techniques 67
• A centralized Data Warehouse is a single physical repository which contains integrated data extracted from external information system.
• Advantages :
– Security
– Ease of Management
– Experience
– Performance
– Expandability
– Reliability
– Vendor Dependency
• Disadvantages
– Server fails entire system get collapse
– May not serve for the Large Scale Application
– Costly
Head Quarters
Operational Processing
Site A Site B
• Many organizations have physically distributed databases with extremely large amounts of data.
• Traditionally the data warehouse would be seen as a centralized repository, whereby data from all sources would be imported into that large centralized repository for analysis.
• Nowadays the speed and bandwidth of wide-area computer networks enables a distributed approach, whereby parts of the data may reside in different places, parts being cached and/or replicated for performance reasons, and the system functions to the outside world as a single global access-transparent repository.
• As the amount of data and number of sites grow, this distributed approach becomes crucial, as a single centralized data warehouse importing data from all the sources has obvious scalability limitations.
• 3 Types
• Local / Global Distributed DW – Serving global businesses where there are local
operations and a central opeations
• Technologically Distributed DW – Where the volume of data is such that it is spread
over multiple physical volumes
• Independently Evolving Distributed DW – It grows-up in uncoordinated manner
Technical Requirement • Managing large amount of data • Managing multiple media • Indexing & Monitoring Data • Efficient loading of Data • Efficient Index Utilization • Data Compaction • Lock Management • Fast Restore