Upload
carmel-barrett
View
275
Download
0
Embed Size (px)
DESCRIPTION
2016年1月23日星期六 2016年1月23日星期六 2016年1月23日星期六 Introduction to D/W 3 Evolution of Database Technology 1960s: Data collection, database creation, network DBMS 1970s: Relational data model, relational DBMS 1980s: Advanced data models (extended-relational, OO, spatial, temporal D/Bs, etc.) 1990s-current: Data mining, data warehousing, multimedia D/B, and Web
Citation preview
二〇二三年五月三日 Introduction to D/W 1
Introduction to Data Warehouse
二〇二三年五月三日 Introduction to D/W 2
Chapter Outline
What is a data warehouse? A show case of data warehousing A multi-dimensional data model Data warehouse architecture Data warehouse tools and utilities
二〇二三年五月三日 Introduction to D/W 3
Evolution of Database Technology
1960s: Data collection, database creation, network DBMS
1970s: Relational data model, relational DBMS
1980s: Advanced data models (extended-relational, OO, spatial,
temporal D/Bs, etc.) 1990s-current:
Data mining, data warehousing, multimedia D/B, and Web
二〇二三年五月三日 Introduction to D/W 4
Reporting without D/W
Data Sources
Reports/ Screens
二〇二三年五月三日 Introduction to D/W 5
Reporting with D/W
Data Sources
Reports/ Screens
Data Warehouse
二〇二三年五月三日 Introduction to D/W 6
Work Involved in Producing a Report
Data Sources
Report/ Screen
Locate the data needed for the report
Gather/select the data Convert the data Merge data from different sources Build the report
二〇二三年五月三日 Introduction to D/W 7
Reporting Work Involved with a D/W
Data Sources
Report/ ScreenData Warehouse
Locate the data needed for the report
Gather the data Convert the data Merge data from different
sources
Build the report
二〇二三年五月三日 Introduction to D/W 8
Work Involved in Producing N Reports
Data Sources
locate gather convert merge build
N Reports/ Screens
locate gather convert merge build
locate gather convert merge build
locate gather convert merge build
locate gather convert merge build
locate gather convert merge build
二〇二三年五月三日 Introduction to D/W 9
Data Sources
Data Warehouse
Locate the data needed for the report
Gather the data Convert the data Merge data from different
sources
N Reports/ Screens
Build the reportBuild the
reportBuild the reportBuild the
reportBuild the reportBuild the
report
Reporting Work Involved with a D/W
二〇二三年五月三日 Introduction to D/W 10
Data Stores, Warehouses and Marts
An operational data store (ODS) stores data for a specific application. It feeds the data warehouse a stream of desired raw data.
A data warehouse is a collection of integrated databases designed to support DSSs.
A data mart is a lower-cost, scaled-down version of a data warehouse, usually designed to support a small group of users (a department rather than an entire firm).
二〇二三年五月三日 Introduction to D/W 11
The Data Warehouse Environment
The organization’s legacy systems and data stores provide data to the data warehouse, data mart or operational data store.
During the transfer of data from the various sources data cleaning or transformation may
occur, so the data in the DW is more uniform.
simultaneously, metadata is recorded. Finally, the DW or mart may be used to
create one or more “personal” warehouses.
二〇二三年五月三日 Introduction to D/W 12
Organizational Data Flow
二〇二三年五月三日 Introduction to D/W 13
A show case of data warehousing
一家連鎖食品公司的簡易資料倉儲使用案例 公司銷售資料特性簡介 各種營運管理 ( 角度 ) 的資料分析報表
二〇二三年五月三日 Introduction to D/W 14
資料特性介紹 - Foodmart北美連鎖食品公司營運資料 共有 24 家連鎖店 (location dimension) 商品類型 (product dimension, concept hierarchy)
47 種產品大分類 107 種子分類
864,558 筆 POS 資料 (1998 年 ) (time dimension) 32,281 筆客戶資料 (customer dimension) 月均營業額達 US$ 1,279,147(1998 年 )Data Warehouse uses a multi-dimensional model. It helps find trends among dimensions.
二〇二三年五月三日 Introduction to D/W 15
Example of Data Analysis Presentation(Categorized Graphs)
分類的、可相比較的差異分析圖表 將多維複雜的資料簡化為人類容易辨讀的
二維空間 ( 最多可呈現四維資料 ) 趨勢或 三維空間 ( 最多可呈現五維資料 ) 趨勢
二〇二三年五月三日 Introduction to D/W 16
Categorized Graphs
二〇二三年五月三日 Introduction to D/W 17
Various Examples of Data Analysis
各分店營運比較 (Location dimension) 客戶差異分析 (Customer dimension) 產品差異分析 (Product dimension) 分店 - 產品 - 客戶群分析 (3D analysis)
二〇二三年五月三日 Introduction to D/W 18
分店營運與產品銷售差異分析(Analysis on Location and Product Dimensions)
找出成功與失敗的分店與產品營運經驗 各分店總營業額差異分析 (1D1M 的分析 ) 各分店各產品銷售額差異分析 (2D1M 的分析 ) 各分店性質相關性分析 ( 序列資料分析 ) 各產品營業額、成本、銷售量差異分析 (1D3M)
二〇二三年五月三日 Introduction to D/W 19
分店月營業額差異分析 (1D1M 的分析 )( 銷售額前、後 12 名,分析銷售額 這 1 measure)
較成功或失敗的店在哪裡 ?原因是 ?
二〇二三年五月三日 Introduction to D/W 20
產品成本、營業額與銷售量差異分析 (1D3M 的分析 ) (Top12 分析 : 成本、營業額、銷售量 3 measures)
成功的產品是什麼 ?成功的原因是 ?利潤分析 ?
二〇二三年五月三日 Introduction to D/W 21
各分店各產品銷售額差異分析 (2D1M 的分析 )
二〇二三年五月三日 Introduction to D/W 22
分店性質相關性分析( 依據各分店各產品的銷售額分析 )連鎖店的經營管理哪些分店的經營特性較相近 ?深入探究原因,掌握共同趨勢,複製成功經驗與避免失敗因素
兩大群中差異最大的分店是 ?
這一群分店是比較成功還是失敗的 ?
二〇二三年五月三日 Introduction to D/W 23
客戶分析例子 (Customer Dimension)(主力消費族群分析) 了解客戶 - 了解什麼樣的客戶買什麼樣的產品
了解讓特定分店與產品成功 / 失敗的客戶是什麼族群 再結合背景人口統計資料,深入探究原因,掌握知識
幾種常見的客戶差異分析報表 客戶年收入差異分析 客戶家中小孩數差異分析 客戶性別差異分析 客戶年齡差異分析
二〇二三年五月三日 Introduction to D/W 24
主力消費族群分析 – 年收入與背景的人口統計資料比較差異,進行顧客差異與定位分析
二〇二三年五月三日 Introduction to D/W 25
各產品主力消費族群分析 ( 年收入的差異 )
二〇二三年五月三日 Introduction to D/W 26
主力消費族群分析 - 家中小孩數為什麼頂客族居多 ?是好,還是壞 ?
二〇二三年五月三日 Introduction to D/W 27
各產品主力消費族群分析( 家中小孩數的差異 )
二〇二三年五月三日 Introduction to D/W 28
主力消費族群分析 – 性別
二〇二三年五月三日 Introduction to D/W 29
主力消費族群分析 – 年齡
二〇二三年五月三日 Introduction to D/W 30
分店性質相關性分析( 依據各分店消費族群各特性 )
Store 21
Store 15
Store 10
Store 19 可根據年收入、家中小孩數、性別、或年齡來分析
兩大群中差異最小、最大的分店是 ?
哪一群分店是比較成功還是失敗的 ?
二〇二三年五月三日 Introduction to D/W 31
What is Data Warehouse?
Defined in many different ways, but not rigorously. A decision support database that is maintained separately
from the organization’s operational database Support information processing by providing a solid platform
of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated,
time-variant (time series), and nonvolatile (read-only) collection of data in support of management’s decision-making process.”—W. H. Inmon
Data warehousing: The process of constructing and using data warehouses
二〇二三年五月三日 Introduction to D/W 32
Data Warehouse Is Subject-Oriented
二〇二三年五月三日 Introduction to D/W 33
Data Warehouse—Subject-Oriented
Data are organized around major subjects, such as customer, product, sales.
Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing for clerks.
Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
二〇二三年五月三日 Introduction to D/W 34
Data Warehouse Is Integrated
Different data coding, data type, attribute name, programming language, DBMS
二〇二三年五月三日 Introduction to D/W 35
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line
transaction records Data cleaning, data transformation and data
integration techniques may be applied. Ensure consistency in naming conventions,
encoding structures, attribute measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
二〇二三年五月三日 Introduction to D/W 36
Data Warehouse—Time Variant
The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly But the key of operational data may or may not
contain “time element”.
二〇二三年五月三日 Introduction to D/W 37
Data Warehouse—Non-Volatile
A physically separate store of data transformed from the operational environment.
Operational update of data does not occur in the data warehouse environment.
Does not require transaction processing, recovery, and concurrency control mechanisms
Requires only two operations in data accessing: Initial loading of historical data Retrieval of data.
二〇二三年五月三日 Introduction to D/W 38
Data Warehouse vs. Operational DBMS
OLTP (On-Line Transaction Processing) Major task of traditional relational DBMS Day-to-day operations: purchasing,
inventory, banking, manufacturing, payroll, registration, accounting, etc.
OLAP (On-Line Analytical Processing) Major task of data warehouse system Data analysis and decision making
二〇二三年五月三日 Introduction to D/W 39
OLTP vs. OLAP
OLTP OLAP users clerk knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, detailed,
relational, isolated historical, summarized, multidimensional, integrated, consolidated
usage repetitive ad-hoc access read/write
index/hash on prim. key lots of scans
unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB ~ GB 100GB ~ TB metric transaction throughput query throughput, response
二〇二三年五月三日 Introduction to D/W 40
Why Separate Data Warehouse? High performance for both systems
DBMS — tuned for OLTP: access methods, indexing, concurrency control, recovery, transaction processing
Warehouse — tuned for OLAP: complex OLAP queries, multidimensional graphs.
Different functions and different data: Missing data: Decision support requires historical data which
operational DBs do not typically maintain Data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources Data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
二〇二三年五月三日 Introduction to D/W 41
From Tables and Spreadsheets to Data Cubes
A data warehouse is based on a multidimensional data model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions by various users
A data cube consists of a set of dimension and fact tables. Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
Cube
Schema
二〇二三年五月三日 Introduction to D/W 42
all
product date country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
Cube: A Lattice of Cuboids(Various Dimension Combinations)
Cube
二〇二三年五月三日 Introduction to D/W 43
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected
to a set of dimension tables Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
二〇二三年五月三日 Introduction to D/W 44
Example of Star Schema (3D3M cube)
time_key(1010111)dayday_of_the_weekmonthQuarter(1Qtr)year
time
location_key(511)Street (1 main Ave.)city (New York) State (NY)Country (USA)
location
Sales Fact Table
item_key(01)
time_key(1010111) location_key(511)
units_sold(4)
dollars_sold(20)
avg_sales(5)
item_key(01)item_namebrandType (TV)supplier_typesupplier_addr
item
DMQLDefineMeasur
es
time Dimension Table
item Dimension Table
location Dimension Table
Cube
二〇二三年五月三日 Introduction to D/W 45
Example of Snowflake Schema (4D3M cube)
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcity_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
item_keyitem_namebrandtypesupplier_key
item
branch_keybranch_namebranch_type
branch
supplier_keysupplier_typesupplier_addr
supplier
city_keycitystatecountry
city
DMQLMeasur
es
二〇二三年五月三日 Introduction to D/W 46
Example of Fact Constellation (4D3M cube)
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_salesMeasures
item_keyitem_namebrandtypesupplier_typeSupplier_addr
item
branch_keybranch_namebranch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipper
DMQL
二〇二三年五月三日 Introduction to D/W 47
DMQL: Language Primitives(Data Mining Query Language)
Cube Definition ( Fact Table )define cube <cube_name> [<dimension_list>]:
<measure_list> Dimension Definition ( Dimension Table )
define dimension <dimension_name> as (<attribute_or_subdimension_list>)
Special Case (Shared Dimension Tables) First time as “cube definition” Second time :
define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time>
二〇二三年五月三日 Introduction to D/W 48
Defining a Star Schema in DMQL
define cube sales [time, item, location]: units_sold = count(*), dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type, address)
define dimension location as (location_key, street, city, state, country)
二〇二三年五月三日 Introduction to D/W 49
Defining a Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:units_sold = count(*), dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type, supplier_addr))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key, state, country))
sales data cube
shipping data cube
二〇二三年五月三日 Introduction to D/W 50
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:units_sold = count(*), dollars_sold = sum(sales_in_dollars),avg_sales = avg(sales_in_dollars)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type,supplier_addr)
define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, state, country)
define cube shipping [time, item, shipper, from_location, to_location]:dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales
二〇二三年五月三日 Introduction to D/W 51
A Concept Hierarchy in Location Dimension
all
Europe North_America
MexicoCanadaSpainGermany
Vancouver
M. WindL. Chan
...
......
... ...
...
all
region
office
country
TorontoFrankfurtcity
define dimension location as (location_key, office, city, country, region)
二〇二三年五月三日 Introduction to D/W 52
View of Warehouses and Hierarchies
Region Country branch_name rep_nameNorth American USA New York Helen GibbsonNorth American Canada Vancouver Hari KrainNorth American USA L.A. John SmithNorth American USA Boston Northern AreaNorth American Canada Toronto Tom SchmidtNorth American USA Boston Deb GardnerNorth American USA New York Helen GibbsonNorth American USA L.A. John SmithNorth American Mexico Mexico City Tim EmpireNorth American Canada Toronto Tom SchmidtNorth American USA New York Helen Gibbson
二〇二三年五月三日 Introduction to D/W 53
Levels of Multidimensional Data Sales volume as a function of product, time, and
location
Prod
uct
Region
Month
Dimensions: Product, Time , Location
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
Hierarchical summarization levels
Location TimeProduct
Current view level: Product, Month , Region
二〇二三年五月三日 Introduction to D/W 54
A Sample Data Cube of SalesTotal annual salesof TV in U.S.A.Time
Item
Loc
atio
nsum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Define OLAP
Total product salesof 1Qtr in U.S.A.
Report
二〇二三年五月三日 Introduction to D/W 55
Browsing a Data Cube(Various Ways : 1D1M & 2D1M)
020406080
100
1Qtr 2Qtr 3Qtr 4Qtr
TVPCVCR
PC Sales in USA
020406080
1Qtr 2Qtr 3Qtr 4Qtr
1Qtr/All Products
0
1020
30
USA Canada Mexico
1Qtr in USA
010203040
TV PC VCR
1
2
3
4
User ?
User ?
User ?
User ?
二〇二三年五月三日 Introduction to D/W 56
Browsing a Data Cube for Various Views(2D1M’s Presentation for 7 Views)
1Qtr 2Qtr 3Qtr 4Qtr USA
Cana
daM
exico
020406080
100
12 5
43
76
二〇二三年五月三日 Introduction to D/W 57
Browsing a Data Cube(3D2M 的呈現 )
二〇二三年五月三日 Introduction to D/W 58
Typical OLAP Operations Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate): Re-orient the cube, visualization, 3D to series of 2D planes.
Other operations drill across: involving (across) more than one fact table (cube) drill through: through the bottom level of the cube to its back-
end relational tables
Cube
Levels
二〇二三年五月三日 Introduction to D/W 59
Typical OLAP Operations
Single Cell Multiple Cells Slice Dice
Roll Up
Drill Down
二〇二三年五月三日 Introduction to D/W 60
A Star-Net Query Model(8D, 1X2X2X3X3X2X3X3=648 combinations of different levels)
Shipping Method
AIR-EXPRESS
TRUCKORDER
Customer Orders
CONTRACTS
Customer
Product
PRODUCT GROUP
PRODUCT LINE
PRODUCT ITEM
SALES PERSON
DISTRICT
DIVISION
OrganizationPromotion
CITY
COUNTRY
REGION
Location
DAILYQTRLYANNUALYTime
Each circle is called a footprint
二〇二三年五月三日 Introduction to D/W 61
Multi-Tiered D/W Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
othersources
Data Storage
OLAP Server
Tool 3Tool 2
Tool 1
二〇二三年五月三日 Introduction to D/W 62
Three Data Warehouse Models
Enterprise warehouse collects all of the information about subjects spanning the
entire organization Data Mart
a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart
Virtual warehouse A set of views over operational databases Only some of the possible summary views may be materialized
二〇二三年五月三日 Introduction to D/W 63
Data Warehouse Development: A Recommended Approach
Define a high-level corporate data model
Data Mart
Data Mart
Distributed Data Marts
Multi-Tier Data Warehouse
Enterprise Data Warehouse
Model refinementModel refinement
二〇二三年五月三日 Introduction to D/W 64
OLAP Server Architectures Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces
Include optimization for each DBMS backend, implementation of aggregation navigation logic, and additional tools and services
greater scalability Multidimensional OLAP (MOLAP)
Array-based multidimensional storage engine (sparse matrix techniques)
fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP)
User flexibility, e.g., low level: relational, high-level: array Specialized SQL servers
specialized support for SQL queries over star/snowflake schemas
二〇二三年五月三日 Introduction to D/W 65
Metadata Repository Meta data defines warehouse objects. It has the following kinds :
Description of the data warehouse structure schema, view, dimensions, hierarchies, derived data definition, data
mart locations and contents Operational meta-data
data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)
The algorithms used for summarization The mapping from operational environment to the data
warehouse Data related to system performance
warehouse schema, view and derived data definitions Business data
ownership of data, charging policies, business terms and definitions
二〇二三年五月三日 Introduction to D/W 66
Data Warehouse Back-End Tools and Utilities(ETL Tools)
Data extraction get data from multiple, heterogeneous, and external sources
Data cleaning detect errors in the data and rectify them when possible
Data transformation convert data from legacy or host format to warehouse format
Data loading sort, summarize, consolidate, compute views, check integrity,
and build indices and partitions Data refresh
propagate the updates from the data sources to the warehouse
二〇二三年五月三日 Introduction to D/W 67
Data Warehouse Usage
Three kinds of data warehouse applications Information processing
supports querying, basic statistical analysis, and reporting using tables, charts and graphs
Analytical OLAP processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice/dice, drilling,
pivoting Data mining
knowledge discovery from hidden patterns supports associations, constructing analytical models,
performing classification and prediction, and presenting the mining results using visualization tools.
二〇二三年五月三日 Introduction to D/W 68
Summary
Data warehouse A subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making process
A multi-dimensional model of a data warehouse Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and pivoting
Data Warehouse Back-End Tools: ETL tools
二〇二三年五月三日 Introduction to D/W 69
Thanks !!!!