Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Spring Term 2020 Slide 1
Data Warehousing
Analytic Applications and
Business Intelligence
Spring Term 2020Dr. Andreas [email protected]
© Andreas Geppert Spring Term 2020 Slide 2
Outline of the Course
� Introduction
� DWH Architecture
� DWH-Design and multi-dimensional data models
� Extract, Transform, Load (ETL)
� Metadata
� Data Quality
� Analytic Applications and Business Intelligence
� Implementation and Performance
© Andreas Geppert Spring Term 2020 Slide 3
Outline
1. Analytic Applications
– Classifications and Architecture
– Semantic Models
2. Query Languages: SQL
3. Reporting
4. Query Languages: MDX
5. OLAP
6. Visualization
7. Dashboards and Scorecards
8. Big Data
9. (Data Mining)
© Andreas Geppert Spring Term 2020 Slide 4
GUIReporting, OLAP,Data Mining
Selection,Aggregation,Calculation
Credit Suisse DWH Reference Architecture V5
…
(Meta)data
Management
Layered Architecture
Data MartsReporting and
Analysis Services
FrontEndDomain Integration and Enrichment
Integration, Aggregation, Calculation
Staging AreaData
SourcesFederated Integration
Reference/
Master
Data
integration enrichment
logic;
extract, transform, load
logic
(no ETL)Legend:
data
flowrelationaldatabase
multidimensionaldatabase
file
© Andreas Geppert Spring Term 2020 Slide 5
Analytic Systems
Concept-oriented Systems
Balanced Scorecard
Planning and Budgeting
Consolidation
Value-oriented Management
Generic Systems
Ad-hoc Analysis Systems
Free OLAP Analysis
Guided OLAP Analysis
Free Data Retrieval
SQL
MDX
Reporting Systems
Interactive Reporting Platforms
Generated Reports
Model-based Analysis Systems
Decision Support Systems
Expert Systems
Data Mining
© Andreas Geppert Spring Term 2020 Slide 6
IT Developers
Production Reporting Tools Statistics
Analysts &
Information Workers
BI Spreadsheets
OLAP
Business Query
Executives &
ManagersDashboards
Interactive Fixed Reports
Scorecards
Front-Line
WorkersEmbedded BI
BI Search
Customers,
Suppliers, Regulators
Published Reports
BI-Tools and (IT-) Skills
Specialization
© Andreas Geppert
Frühlingssemester 2008 Slide 6
Source: Howson 2008
© Andreas Geppert Spring Term 2020 Slide 7
Decisions: frequency and Economic Impact
High-impact,
infrequent decisions
Ex: M&A,
capital investment,
strategic market
positioning
Medium-impact,
medium frequent
Decisions
Ex: product development-
and pricing,
customer segmentation
Low-impact,
frequent decisions
Ex: Loan request,
Cross-sell offers,
customer upgrade
Frequency of Decision
Eco
no
mic
Im
pa
ct o
f In
div
idu
al
De
cisi
on
s
Source: Howson 2008
© Andreas Geppert Spring Term 2020 Slide 8
BI System Architecture
BI Server
(Caching Optimization Security Workflow)
DWH CubeERP
CRM OLTPSpread
sheet
Semantic Layer
Tool ToolSpread
sheet
© Andreas Geppert Spring Term 2020 Slide 9
Semantic Models
� Especially business users and in ad-hoc reporting, skills to
access relational data models via SQL cannot be expected
� semantic models (semantic layers) form an intermediate layer
between database/DWH and users:
– Semantic models are closer to business language and terminology than
relational models and star schemas
– Semantic models abstract from database structures: joins and aggregations
etc. can be hidden in the mapping of the semantic layer onto database
structures
– Ideally all the required BI tools integrate with the semantic layer
� Examples: Business Objects Universes, OBIEE
© Andreas Geppert Spring Term 2020 Slide 10
Outline
1. Analytic Applications– Classifications and Architecture– Semantic Models
2. Query Languages: SQL
Star Joins
Super groups Aggregate and analysis functions Local grouping
3. Reporting4. Query Languages: MDX5. OLAP6. Visualization7. Dashboards and Scorecards8. Big Data
© Andreas Geppert Spring Term 2020 Slide 11
Ex
am
ple
Sc
he
ma
© Andreas Geppert Spring Term 2020 Slide 12
Query Patterns: Star Queries
� Queries against star schemas
� Joins fact table with some or all of the dimension tables
Star queries, star join
� Typically restrictions in dimension tables
� Typically also grouping and aggregation in the resulting table
© Andreas Geppert Spring Term 2020 Slide 13
Example: Star Queries
� Sales per product, store, and day
select p.product_name, s.store_name, t.the_date,
sum(f.unit_sales) as sales
from sales_fact f, store s, product p, time_by_day t
where f.product_id = p. product_id
and f.store_id = s. store_id
and f.time_id = t. time_id
group by p.product_name, s.store_name, t.the_date;
© Andreas Geppert Spring Term 2020 Slide 14
Outline
1. Analytic Applications– Classifications and Architecture– Semantic Models
2. Query Languages: SQL Star Joins
Super groups
Aggregate and analysis functions Local grouping
3. Reporting4. Query Languages: MDX5. OLAP6. Visualization7. Dashboards and Scorecards8. Big Data
© Andreas Geppert Spring Term 2020 Slide 15
Grouping in SQL
� traditionally:
– group-by clause
– Per query, there is a fixed set of grouping criteria
� Suboptimal for flexible grouping
– Along multiple dimension
– On multiple levels of a dimension hierarchy
� all combinations of grouping attributes over G1,...,Gn ?
– 2n queries with corresponding grouping criteria
– (product, store, date)
[(product, store, date), (store, date), (product, date), (product, store)
(product), (store), (date), ()]
Super groups
– Grouping sets, rollup, cube
© Andreas Geppert Spring Term 2020 Slide 16
Grouping Sets
� Groups along multiple grouping criteria
� In a single query!
Grouping sets
Explicit listing of all grouping criteria
� Example: sum of sales, per
– product,
– day,
– product and day
in a single, same query
© Andreas Geppert Spring Term 2020 Slide 17
Grouping Sets: Sample Data
PRODUCT SALES_DATE SALES_CNT
--------- ---------- ---------
Cornetto 06.06.06 15
Magnum 06.06.06 25
06.06.06 5
Cornetto 07.07.06 22
Magnum 07.07.06 33
07.07.06 6
Cornetto 44
Magnum 9
© Andreas Geppert Spring Term 2020 Slide 18
Grouping Sets: Example
select product, sales_date, sum(sales_cnt) total
from kiosk
group by grouping sets ( (product),
(sales_date),
(product, sales_date));
© Andreas Geppert Spring Term 2020 Slide 19
Grouping Sets: Example (2)
� Possible result:PRODUCT SALES_DATE TOTAL
-------- ---------- -----
Magnum (null) 9
Cornetto (null) 44
(null) 06.06.06 5
Magnum 06.06.06 25
Cornetto 06.06.06 15
(null) 07.07.06 6
Magnum 07.07.06 33
Cornetto 07.07.06 22
(null) (null) 53
(null) 06.06.06 45
(null) 07.07.06 61
(null) (null) 11
Magnum (null) 67
Cornetto (null) 81
© Andreas Geppert Spring Term 2020 Slide 20
Grouping Function
� SQL queries generate null values
– Meaning: „all“
– Set of values cannot be represented in first normal form
� Additionally there may be null values in the data
– Meaning: not existing, unknown
In the query result, the meaning of NULL (-) is not obvious
Grouping function
Shows whetherthe row is the result of grouping over the column
0: NULL has been in the data
1: NULL because of grouping ("all")
© Andreas Geppert Spring Term 2020 Slide 21
The Grouping Function: Example
� Similar query as above
select product, grouping(product) as prodgrp,
sales_date, grouping(sales_date) dategrp,
sum(sales_cnt) as total
from kiosk
group by grouping sets ( (product),
(sales_date),
(product, sales_date))
© Andreas Geppert Spring Term 2020 Slide 22
The Grouping Function: Example (2)
� result:product prodgrp sales_date dategrp total --------------- ---------- -------------- ------- ---- -----------Cornetto 0 06.06.06 0 15
Cornetto 0 07.07.06 0 22
Cornetto 0 0 44
Cornetto 0 1 81
Magnum 0 06.06.06 0 25
Magnum 0 07.07.06 0 33
Magnum 0 0 9
Magnum 0 1 67
0 06.06.06 0 5
1 06.06.06 0 45
1 07.07.06 0 61
0 07.07.06 0 6
1 0 53
0 1 11
© Andreas Geppert Spring Term 2020 Slide 23
The Grouping Function: Example (3)
� Query as above, with declaration of the «all» values:
select decode(grouping(product), 1, 'All Products', product)
decode(grouping(sales_date), 1, 'All Dates', sales_date),
sum(sales_cnt) as total
from kiosk
group by grouping sets ( (product),
(sales_date),
(product, sales_date));
© Andreas Geppert Spring Term 2020 Slide 24
The Grouping Function: Example (4)
� Result:product date total --------------- ------- ---- -----------All Products 06.06.06 45
All Products 07.07.06 61
All Products (null) 53
Cornetto 06.06.06 15
Cornetto 07.07.06 22
Cornetto (null) 44
Cornetto All Dates 81
Magnum 06.06.06 25
Magnum 07.07.06 33
Magnum (null) 9
Magnum All Dates 67
(null) 06.06.06 5
(null) 07.07.06 6
(null) All Dates 11
© Andreas Geppert Spring Term 2020 Slide 25
The Cube Operator
� Grouping with all possible combinations?
� G1...Gn 2n criteria with grouping sets
abbreviation: the cube operator
� cube(G1...Gn) grouping sets( 2{G1...Gn})
� Example: cube(A, B) grouping sets((A,B), (A), (B), ())
� ( ): grand total
© Andreas Geppert Spring Term 2020 Slide 26
The Cube Operator: Example
� Sales grouped by:
– Product family
– Customer country
– Product family and customer country
– And overall sum (grand total)
select product_family, country, sum(store_sales)
from sales_fact, product, product_class pc, customer c
where ...
group by cube(pc.product_family, c.country);
© Andreas Geppert Spring Term 2020 Slide 27
The Cube Operator: Example(2)
PRODUCT_FAMILY COUNTRY sales
-------------- ------- ------------
Drink Canada 14256.53
Drink Mexico 57991.18
Drink USA 150349.99
Drink - 222597.70
Food Canada 124649.63
Food Mexico 568431.52
Food USA 1376761.54
Food - 2069842.69
Non-Consumable Canada 30845.95
Non-Consumable Mexico 137558.68
Non-Consumable USA 329025.81
Non-Consumable - 497430.44
- Canada 169752.11
- Mexico 763981.38
- USA 1856137.34
- - 2789870.83
© Andreas Geppert Spring Term 2020 Slide 28
The Rollup Operator
� Often we are not interested in all possible grouping criteria
� But mainly in all aggregates along a (subset of a) dimension
hierarchy
� This is, we would like to see a stepwise rollup
Rollup operator
� Computes n grouping combinations + grand total
� rollup(Family, Department, Product)
grouping sets((Family, Department, Product),
(Family , Department), (Family), ()
© Andreas Geppert Spring Term 2020 Slide 29
The Rollup Operator: Example
� Sum of sales per
– Product family and product department
– Product family
– overall (grand total)
select product_family,product_dep't,sum(store_sales)
from sales_fact f, product p, product_class pc
where ...
group by rollup(pc.product_family,
pc.product_department)
© Andreas Geppert Spring Term 2020 Slide 30
The Rollup Operator: Example (2)
PRODUCT_FAMILY PRODUCT_DEPARTMENT SALES
-------------- ------------------ ----------
Drink Alcoholic Beverages 68118.28
Drink Beverages 123181.16
Drink Dairy 31298.26
Drink - 222597.70
Food Baked Goods 107589.71
Food Baking Goods 185836.08
Food Breakfast Foods 36523.92
Food Canned Foods 183554.39
Food Dairy 152413.99
Food Eggs 43091.95
Food Frozen Foods 287099.75
... ... ...
Food - 2069842.69
Non-Consumable Health and Hygiene 144139.47
Non-Consumable Household 283380.38
Non-Consumable Periodicals 40860.55
Non-Consumable - 497430.44
- - 2789870.83
© Andreas Geppert Spring Term 2020 Slide 31
Outline
1. Analytic Applications– Classifications and Architecture– Semantic Models
2. Query Languages: SQL Star Joins Super groups
Aggregate and analysis functions
Local grouping
3. Reporting4. Query Languages: MDX5. OLAP6. Visualization7. Dashboards and Scorecards8. Big Data
© Andreas Geppert Spring Term 2020 Slide 32
New Aggregation Functions in SQL
� Traditional aggregate functions
– sum, count, min, max, avg
– Operate on entire columns or on partitions resulting from a traditional group-by
clause („global grouping“)
� This is often not sufficient for analytic queries
– For instance, aggregates (or more general, calculated values) should be
computed based on a subset of the result set relative to single tuples(rows
� „local“ grouping
– Other terms: window functions, OLAP functions, new aggregate functions
� Examples:
– Ranking: top seller
– Numbering of rows
– Position of elements in a list
© Andreas Geppert Spring Term 2020 Slide 33
Rank Functions
� Rank functions
– Assign a numeric value to each individual row
– Rank = position of the element in a sorted list
– Not that the position of a tuple in a list obtained by a sort-by clause is
implicit!
� Rank operator
– Sort criterion is specified in a (new) order clause
– If two or more elements tie, the following ranks are not assigned
� denserank: no gaps in ranks
� Numbering function:
– Enumerates tuples
– never assigns equal numbers to tuples
– row_number
© Andreas Geppert Spring Term 2020 Slide 34
Rank Functions: Examples
A B C
a1 b1 9a2 b2 8a3 b1 8a4 b2 6a5 b1 5a6 b2 4
a7 b1 3
rank () over(order by c)
3
1
2
4557
denserank () over(order by c)
3
1
2
4556
row_number() over(order by c)
3
1
2
4567
© Andreas Geppert Spring Term 2020 Slide 35
Rank Functions: Examples
� The list of products, ordered by total sales, including
position in the sorted list
select p.product_name,
f.storeSales,
rank() over(order by f.storeSales desc)
as salesRank
from productSalesV f join
product p on f.product_id = p.product_id
order by salesRank asc;
© Andreas Geppert Spring Term 2020 Slide 36
Rank Functions: Examples (2)
PRODUCT_NAME STORESALES SALESRANK
------------------------------- --------------- ---------
Carrington Turkey TV Dinner 11753.84 1
Big Time Apple Cinnamon Waffles 11585.34 2
CDR Vegetable Oil 11493.60 3
High Quality 60 Watt Lightbulb 11408.76 4
Ebony Lettuce 11371.86 5
...
Super Columbian Coffee 245.00 1558
Top Measure Chardonnay Wine 240.12 1559
© Andreas Geppert Spring Term 2020 Slide 37
Rank Functions: Examples (3)
� The ten best-selling products
select *
from table(select p.product_name,
f.storeSales,
rank() over(order by f.storeSales desc)
as salesRank
from productSalesV ...) sr
where salesRank <= 10
order by salesRank;
© Andreas Geppert Spring Term 2020 Slide 38
Outline
1. Analytic Applications– Classifications and Architecture– Semantic Models
2. Query Languages: SQL Star Joins Super groups
Aggregate and analysis functions Local grouping
3. Reporting4. Query Languages: MDX5. OLAP6. Visualization7. Dashboards and Scorecards8. Big Data
© Andreas Geppert Spring Term 2020 Slide 39
Local Grouping
� Calculate aggregates based on partitions given by the context of
single tuples
«local» grouping
Each individual tuple defines an aggregate (possibly together with other
tuples)
� Comparisons to other aggregates are possible
– For instance, change in monthly sales compared to the previous year
� Moving average
– For instance, product sales averaged over previous, current, and following
month
� Cummulated sums
– For instance, monthly sums added up to current month (year-to-date)
© Andreas Geppert Spring Term 2020 Slide 40
Local Grouping (2)
� Hard to formulate (if possible at all) with traditional SQL
– Possibly multiple SQL statements are required
Local grouping
One output tuple per input tuple
Local partitioning criterion
Local ordering criterion
Window size
© Andreas Geppert Spring Term 2020 Slide 41
Local Partitioning
� Partition is defined based on «current» tuple
� Aggregates will be calculated over this partition
� partition-by clause
� Typically used for pre-aggregated data
� Often used to compute ratios (ratio-to-report)
© Andreas Geppert Spring Term 2020 Slide 42
Local Partitioning (2)
A B Ca1 b1 6a2 b2 5a3 b1 5a4 b2 4a5 b1 3a6 b2 2a7 b1 1
15
1511
11151115
14
1511
91156
sum(c) over(partition by b)sum(c) over(partition by b order by a)
© Andreas Geppert Spring Term 2020 Slide 43
Local Partitioning: Example
� Sum of sales per day plus ratio to monthly (percent) and yearly sales (per mill)
select theDate, storeSales,
sum(storeSales) over(partition by month(theDate))
as mSales,
100 * storeSales /
sum(storeSales) over(partition by month(theDate))
as monPcnt,
sum(storeSales) over(partition by year(theDate))
as ySales,
1000 * storeSales /
sum(storeSales) over(partition by year(theDate))
as yearPmil
from timeSalesV;
© Andreas Geppert Spring Term 2020 Slide 44
Local Partitioning: Example (2)
thedate storesales msales monpcnt ysales yearpmil
---------- ---------- ---------- ------- ----------- -----
01/01/2003 1139.06 228289.43 0.00 890572.78 1.00
...
01/05/2003 2877.56 228289.43 1.00 890572.78 3.00
...
02/15/2003 4939.80 217254.31 2.00 890572.78 5.00
02/16/2003 3195.47 217254.31 1.00 890572.78 3.00
...
01/02/2004 6379.30 228289.43 2.00 1899298.05 3.00
...
02/01/2004 6706.19 217254.31 3.00 1899298.05 3.00
...
© Andreas Geppert Spring Term 2020 Slide 45
Local Grouping: Calculation of Ratios
� Calculation of ratios using ratio_to_report� Determines the relative share that a value contributes to a
sumselect …
ratio_to_report(sales)
over (partition by month(theDate)) * 100
as monPcnt
…
© Andreas Geppert Spring Term 2020 Slide 46
Windows
� Windows can be moved over a (intermediate) table
� Aggregates will be computed over the data «visible through the
window»
� Window (size) defined in terms of the current tuple
Window clause
� options:
– Position based:
n (or all, none) tuples before T (in the specified sort order)
n (or all, none) tuples after T (in the specified sort order)
– Value based
© Andreas Geppert Spring Term 2020 Slide 47
Windows (2)
Bb1b2b1b2b1b2b1
A Ca1 6a2 5a3 5a4 4a5 3a6 2a7 1
partition by b order by a
rows between 1 preceding and 1 following
© Andreas Geppert Spring Term 2020 Slide 48
Moving Average: Example
� Three-month average of the sum of sales
select monat, jahr, storeSales,
avg(storeSales)
over(partition by jahr
order by monat
rows between 1 preceding and 1 following)
as avg_3_mon
from monthSalesV
© Andreas Geppert Spring Term 2020 Slide 49
Moving Average: Example (2)
MONAT JAHR STORESALES AVG_3_MON
----------- ----------- ---------- -----------
1 2003 70923.85 71068.94
2 2003 71214.04 74140.24
3 2003 80282.83 72513.60
4 2003 66043.94 72294.36
...
9 2003 69036.89 68672.30
10 2003 64944.22 72704.42
11 2003 84132.16 79524.97
12 2003 89498.53 86815.34
1 2004 157365.58 151702.92
2 2004 146040.27 153100.34
3 2004 155895.17 150537.91
...
© Andreas Geppert Spring Term 2020 Slide 50
Cumulated Sums
� Individual tuples contribute successively to the computation of a
sum
� Sum in step 1 (per partition!) = attribute value of the first tuple
� Sum in step n+1 (per partition!) =
sum of step n + attribute value of the n+1st tuple
� Result tuples represent cumulated sums
© Andreas Geppert Spring Term 2020 Slide 51
Cumulated Sums: Example
� Monthly sales and sum of sales up to and including the current
month
select monat, jahr, storeSales,
sum(storeSales) over(partition by jahr
order by monat
rows unbounded preceding) cum_Sales
from monthSalesV;
© Andreas Geppert Spring Term 2020 Slide 52
Cumulated Sums: Example (2)
MONAT JAHR STORESALES CUM_SALES
----------- ----------- ---------- -------------
1 2003 70923.85 70923.85
2 2003 71214.04 142137.89
3 2003 80282.83 222420.72
4 2003 66043.94 288464.66
5 2003 70556.32 359020.98
...
11 2003 84132.16 801074.25
12 2003 89498.53 890572.78
1 2004 157365.58 157365.58
2 2004 146040.27 303405.85
3 2004 155895.17 459301.02
4 2004 149678.31 608979.33
...
© Andreas Geppert Spring Term 2020 Slide 53
Processing of Analytic Queries
� Data filtering (WHERE)
� (global) grouping (GROUP-BY)
� Filtering of aggregates (HAVING)
� Computation of analytic functions– Each analytic function is computed for itself
– Creation of partitions
– Sorting of partitions
– Application of ranking or aggregate functions
� Sorting of final result (ORDER BY)
� Note that WHERE and HAVING are applied before the
computation of analytic functions
Results of analytic functions cannot be referred to in these
clauses
© Andreas Geppert Spring Term 2020 Slide 54
Outline
1. Analytic Applications
– Classifications and Architecture
– Semantic Models
2. Query Languages: SQL
3. Reporting
4. Query Languages: MDX
5. OLAP
6. Visualization
7. Dashboards and Scorecards
8. Big Data
© Andreas Geppert Spring Term 2020 Slide 55
Reporting
� Pre-defined, created and distributed periodically
� Or defined, created, and consumed on-demand
� Report consists in general of data and layout
� Data are typically obtained through database queries
� Layout can be tabular and/or graphical
� Often reporting is distinguished according to the purpose or
domain
– Management reporting
– Performance reporting
– Financial reporting
– Regulatory reporting
– technical reporting (e.g. availability reporting)
© Andreas Geppert Spring Term 2020 Slide 56
Reporting: Standard Reports
� Pre-defined
� Created regularly (e.g., month end)
� Distributed to consumers
� Replaces traditional, paper-based reporting
� Possibly with very high requirements and expectations regarding
layout and look-and-feel («pixel-perfect reports»)
� Reports are often defined by specialized IT staff
� Reports are typically developed as regular software projects
© Andreas Geppert Spring Term 2020 Slide 57
Reporting: Parameterized Reports
� Very similar to standard reports
� Report and database query contain formal parameters which are
instantiated at report generation time
� Actual parameters are then specific for individual report
consumers
– Access to such reports often needs to be restricted to a small group of
consumers
� Example: net new assets per relationship manager
� (very basic) drill down can be implement by linking
parameterized reports
© Andreas Geppert Spring Term 2020 Slide 58
Reporting: Phases
� Report definition
– Tool-based
– Should ideally be possible without deep database or IT skills
– Graphical interface, report editor
– Metadata support is esssential
� Report creation
– Data are extracted by executing queries
– Depending on the report definition and database query, the reporting tool
can also perform some kind of processing
© Andreas Geppert Spring Term 2020 Slide 59
Reporting: Phases (2)
� Formatting
– Report is formatted according to the layout definition
– Tables and/or charts
� Publishing and distribution
– Report can be stored on the reporting tool’s infrastructure
– Consumers can be notified about availability of the report
– Reports may also be distributed via email directly
© Andreas Geppert Spring Term 2020 Slide 60
Reporting: Ad-hoc Reports
� Not pre-defined, satisfy an urgent, one-off information need
� Reporting phases collapse, in particular the time gap between
report definition and generation does not exist
� In orderto provide the required agility, ad-hoc reports cannot be
developed with the same rigorous project approach as
standard/parameterized reports
� Typically ad-hoc reports should be definable by end-users, at
least power users
– Implies ease-of-use requirements of the reporting platform and tool
– Layout requirements are typically less strict
© Andreas Geppert Spring Term 2020 Slide 61
Outline
1. Analytic Applications
– Classifications and Architecture
– Semantic Models
2. Query Languages: SQL
3. Reporting
4. Query Languages: MDX
5. OLAP
6. Visualization
7. Dashboards and Scorecards
8. Big Data
© Andreas Geppert Spring Term 2020 Slide 62
MDX
� Multidimensional Expressions
� Initially proposed by Microsoft
� Query language of SQL Server Analysis Services (previously
OLAP Services)
� In the meantime also supported by other multidimensional
database systems (e.g. Essbase, Mondrian, Alphablox, ...)
© Andreas Geppert Spring Term 2020 Slide 63
Structure of MDX Statements
� SELECT (axis dimensions)– columns: set of elements ON COLUMNS
– rows: set of elements ON ROWS
– ... plus ON PAGES, SECTIONS, CHAPTERS, ...
� FROM (cube specification)– Reference to typically a single cube
– In principle a multi-dimensional join between cubes is possible as well
� WHERE (“Slicer” dimensions)– Restriction of the data range
� Measures of a cube– Elements of the mandatory dimension “Measures”
– Standard aggregation operators are defined on schema level (sum, min,
max, count)
© Andreas Geppert Spring Term 2020 Slide 64
Sample MDX-Query
select {Produkt.Abteilung.Members} on Columns,
{Standort.Kanton.Members} on Rows
from KioskSales
where (Measures.[Anzahl Verkäufe] )
Lebensmittel Schreibwaren Zeitschriften
Zürich 2 2
Aargau 2 2 5
Uri 2 2
© Andreas Geppert Spring Term 2020 Slide 65
Set Expressions
� Enumeration
– {USA, CA, SF, SJ, Aargau}
� Element expressions
– Schweiz.CHILDREN: returns Cantons {ZH, AR, AG, ....}
– ZH.PARENT: returns Switzerland
– DESCENDANTS(Schweiz, Cities): Decendants on level Cities
– Time.Quarter.MEMBERS: Enumeration of all elements of a dimension
hierarchy level
© Andreas Geppert Spring Term 2020 Slide 66
Set Expressions (2)
� Creation of sets
GENERATE ({USA, Schweiz},
DESCENDANTS(Geography.CURRENT, Cities))
– Enumerates all cities in Switzerland and USA
� Nesting sets
CROSSJOIN({USA, Schweiz}, {Mike, John}):
{(USA, Mike), (USA, John), (Schweiz, Mike), (Schweiz, John)}
© Andreas Geppert Spring Term 2020 Slide 67
Sample MDX-Query with Crossjoin
select Produkt.Abteilung.Members on Columns,
Crossjoin (Standort.Kanton.Members,
{Datum.[2002].Januar,
Datum.[2002].Februar})
on Rows
from KioskSales
where (Measures.[Anzahl Verkäufe] )
Lebensmittel Schreibwaren Zeitschriften
ZürichJanuar 1
Februar 1
AargauJanuar 1 1
Februar 2 2
UriJanuar 1
Februar 1 1
© Andreas Geppert Spring Term 2020 Slide 68
Set Expressions (3)
� Relative reference
– Zeit.[1999].LastChild: fourth quarter1999
– [1999].NextMember: 2000
– [1990]:[2000]: [1990], ..., [2000]
� Level functions
– Schweiz.LEVEL: returns Country
– Zeit.LEVELS(1): returns Year (counting top down)
© Andreas Geppert Spring Term 2020 Slide 69
Special Functions
� TOPCOUNT, TOPPERCENT, TOPSUM
SELECT {[Anzahl Verkäufe]} on COLUMNS,
{TOPCOUNT(Schweiz.CHILDREN, 5,
Sales)}
ON ROWS
FROM KioskSales
WHERE ([Anzahl Verkäufe], [2005])
© Andreas Geppert Spring Term 2020 Slide 70
Special Functions (2)
� FILTER
– in a WHERE clause, only slicers can be specified
– For specification of predicates, filters have to be used
SELECT FILTER({Schweiz.CHILDREN},
([2005],Sales) > 500) ON COLUMNS,
Quarters.MEMBERS ON ROWS
FROM KioskSales
WHERE ([AnzahlVerkäufe], [2005])
© Andreas Geppert Spring Term 2020 Slide 71
Summary: MDX
� Powerful language for the specification of OLAP queries
– Top-level structure similar to SQL
– Set expressions provide elengant ways to operate on dimension hierarchies
� Functionality
– Many OLAP functions
– Derived measures (WITH clause)
presentation aspects (on rows/columns) as part of queries
© Andreas Geppert Spring Term 2020 Slide 72
Outline
1. Analytic Applications
– Classifications and Architecture
– Semantic Models
2. Query Languages: SQL
3. Reporting
4. Query Languages: MDX
5. OLAP
6. Visualization
7. Dashboards and Scorecards
8. Big Data
© Andreas Geppert Spring Term 2020 Slide 73
OLAP: "Definition"
� Rules by Codd and others
� 12 rules for the evaluation of OLAP products
� Later extended with 6 further features
� Re-grouped into four groups:
– Basic features
– Special features
– Reporting features
– Dimension control
© Andreas Geppert Spring Term 2020 Slide 74
OLAP: Codd’s Rules
1. Multidimensional conceptual view
2. Transparency
Transparency of the architecture and the database environment (data
origin)
3. Accessibility
Integration of heterogeneous schemas and data
4. consistent reporting performance
No performance degradation when number of dimensions grows or
database size increases
5. Client/Server architecture
Logically and physically
© Andreas Geppert Spring Term 2020 Slide 75
OLAP: Codd’s (2)
6. generic dimensionality
7. Dynamic management of sparse cubes
8. Multi-user mode
9. Unrestricted operations across dimensions
10. Intuitive data manipulation
11. Flexible reporting
12. Unrestricted dimensions and aggregation
© Andreas Geppert Spring Term 2020 Slide 76
OLAP: Definition by the OLAP Council
� On-Line Analytical Processing (OLAP) is a category of software technology that enables
analysts, managers and executives to gain insight into data through fast, consistent,
interactive access to a wide variety of possible views of information that has been
transformed from raw data to reflect the real dimensionality of the enterprise as understood
by the user.
� OLAP functionality is characterized by dynamic multi-dimensional analysis of consolidated
enterprise data supporting end user analytical and navigational activities including: – calculations and modeling applied across dimensions, through hierarchies and/or across members
– trend analysis over sequential time periods
– slicing subsets for on-screen viewing
– drill-down to deeper levels of consolidation
– reach-through to underlying detail data
– rotation to new dimensional comparisons in the viewing area
� OLAP is implemented in a multi-user client/server mode and offers consistently rapid response
to queries, regardless of database size and complexity. OLAP helps the user synthesize
enterprise information through comparative, personalized viewing, as well as through analysis
of historical and projected data in various "what-if" data model scenarios. This is achieved
through use of an OLAP Server.
© Andreas Geppert Spring Term 2020 Slide 77
OLAP: FASMI-Test
� Coined by Nigel Pendske
� Fast Analysis of Shared Multi-dimensional Information
� Fast
– Analytic queries must be executed efficiently
– Especially when queries are ad-hoc and interactive
– Balance between:
Pre-computation ( database explosion) and
on-the-fly computation (Performance)
© Andreas Geppert Spring Term 2020 Slide 78
OLAP: FASMI-Test (2)
� Analysis
– Relevant business logic and statistical analysis use cases are supported
– Comprehensible for end users
– Ad-hoc calculations and analysis
� Shared
– Multi-user access
– security
� Multi-dimensional
– Full support for dimensions, hierarchies, including parallel hierarchies
� Information
– Ability to handle large data volumes
– "Input", not consumed storage!
© Andreas Geppert Spring Term 2020 Slide 79
MOLAP: General Architecture
� Support for analysis of
multi-dimensional data
� Multi-dimensional structures
as storage objects
Client
Server
Data Store
© Andreas Geppert Spring Term 2020 Slide 80
MOLAP
� Physical storage of cubes: nested arrays
� Arrays contain finest granularity required for analysis
� Designed for multi-dimensional analysis
� multidimensional OLAP, MOLAP
� Efficient execution of analytic queries possible (depending on query and
design)
� With fine granularity, most of the cube cells are empty (typically > 95%)– Compression of spare dimensions and subcubes
– Complex physical design
� Scalability becomes a problem (performance degrades with growing
cubes)
� Storage of finest granularity is then no longer possible– Use coarser granularity
– Analysis of detail data no longer possible
© Andreas Geppert Spring Term 2020 Slide 81
ROLAP: High-level Architecture
Client
Server
Data Store
© Andreas Geppert Spring Term 2020 Slide 82
ROLAP (2)
� Use relational database systems as storage system
� Map multidimensional structures onto relational tables
(Star-Schema)
� Implementation of management of and queries against
multi-dimensional structures with SQL
� relational OLAP, ROLAP
© Andreas Geppert Spring Term 2020 Slide 83
ROLAP (3)
� Unrestricted number of dimensions
� Management of very large data volumes possible (many TB)– Good scalability
� Skills and experiences are often available
� Analysis of detail level possible
� Execution of complex analytic queries not always possible– See section on SQL
� Performance (query response time) often worse than with
MOLAP– Usage of caches to improve performance (reduce I/O)
� Extensions of relational systems– New operations (some of them already standardized)
– Improved implementations and progress regarding optimizers and access
paths (Teradata, DB2, Oracle, SQLServer)
© Andreas Geppert Spring Term 2020 Slide 84
HOLAP: High-level Architecture
Client
Server
Data Store
© Andreas Geppert Spring Term 2020 Slide 85
HOLAP (2)
� hybrid OLAP
� Tries to combine the advantages of ROLAP and MOLAP
� Multi-dimensional DBS
– Stores aggregates (coarse granularity)
– Ability to analyze detail data
� Drill-through to relational tables
� Cubes and tables can be stored in the same of in different
database systems
© Andreas Geppert Spring Term 2020 Slide 86
Outline
1. Analytic Applications
– Classifications and Architecture
– Semantic Models
2. Query Languages: SQL
3. Reporting
4. Query Languages: MDX
5. OLAP
6. Visualization
7. Dashboards and Scorecards
8. Big Data
© Andreas Geppert Spring Term 2020 Slide 87
Data Visualization
Data visualization
• encompasses all sorts of visual representation supporting the exploration, investigation, and communication of data (S. Few)
Visualization of Information
• (vs. scientific visualization): the use of computer-supported, interactive, visual representations of abstract data to amplify recognition”(Card, Mackinlay, Shneiderman)
© Andreas Geppert Spring Term 2020 Slide 88
Data Visualization
� s. Few2009
0
500
1000
1500
2000
2500
3000
3500
4000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Inland
Ausland
© Andreas Geppert Spring Term 2020 Slide 89
Visual Perception
� Text and tables are read and processed subsequently value by
value
� Graphs can be consumed as a whole
� Characteristics that can be perceived particularly easily:
– Position (2D)
– Length
– Width
– Area
– Form
– Color
– Orientation
� “pre-attentive attributes of visual perception”
© Andreas Geppert Spring Term 2020 Slide 90
Visual Perception: Basic Elements
Points
■ Points
– Two-dimensional position
■ Lines
– Two-dimensional position +
connections
■ Columns
– Height or length
■ Boxes0
1
2
3
4
5
6
A B C D E
© Andreas Geppert Spring Term 2020 Slide 91
Visual Perception: Basic Elements
Lines Columns
0
1
2
3
4
5
6
A B C D E
0
1
2
3
4
5
6
A B C D E
© Andreas Geppert Spring Term 2020 Slide 92
Visualization Best Practices
� Graph types
� Dimensionality
� Trellis charts small multiples
© Andreas Geppert Spring Term 2020 Slide 93
Charts
� Bar and column charts
– in many cases the best-suited type of way to
visualize quantitative information
– Comparisons, maximum and minimum are
easy to detect
– In order to enable reasonable comparisons,
the x axis must intersect the y axis at 0
� Line charts
– Well-suited to visualize the trend of
quantitative information over time
– Can be meaningfully combined with column
charts
© Andreas Geppert Spring Term 2020 Slide 94
Charts (2)
� Pie charts
– The chart type that is most often mis-used
– Well suited (if at all) for visualizing proportions
– Comparisons are often difficult, because arcs or areas have to be compared
� Maps
– Visualization of geo-coded data
– Visualization of geographical concentration
� Scatterplot
– Represents relationship (or the lack thereof) between two variables
– Identification of trends, clusters, outliers
© Andreas Geppert Spring Term 2020 Slide 95
Charts (3)
� Bubble Charts
– Not a chart type in ist own right
– Represents additional information in other charts such as maps or
scatterplots
� Heatmaps
– Colored visualization of the relationship between two variables
– Third variable can be represented via the size (area) of the rectancles
© Andreas Geppert Spring Term 2020 Slide 96
Dimensionality
� Two-dimensional charts are
preferred
� Three-dimensional charts are
often problematic and hard to
read
© Andreas Geppert Spring Term 2020 Slide 97
Trellis Charts
� A Trellis chart (small
multiples) is a series of
multiple, small, similar
charts
� The series allows one to
visualize an additional
variable or dimension
© Andreas Geppert Spring Term 2020 Slide 98
Outline
1. Analytic Applications
– Classifications and Architecture
– Semantic Models
2. Query Languages: SQL
3. Reporting
4. Query Languages: MDX
5. OLAP
6. Visualization
7. Dashboards and Scorecards
8. Big Data
© Andreas Geppert Spring Term 2020 Slide 99
Dashboards
� Dashboards contain several related indicators or reports
� Usually together with advanced visualiation elements
– Maps
– Reports with content-based formatting
– Travel light visualiation
– Speedometer etc.
� "A dashboard is a visual display of the most important information
needed to achieve one or more objectives; consolidated and arranged
on a single screen so the information can be monitored on a glance"
(Stephen Few, Information Dashboard Design)
© Andreas Geppert Spring Term 2020 Slide 100
Dashboard Example: WEBeMars
� Source: web-based Emergency Medicine Analysis & Reporting System (http://www.edims.net/webemars.php )
© Andreas Geppert Spring Term 2020 Slide 101
Dashboard: Sample Metrics
� Sales
– Orders
– Invoices
– Sales pipeline
– Number of orders
– Sales prices
� Marketing
– Market share
– Campaing success
– Customer demographics
� Finances
– Turnover
– Costs
– Profit
� HR
– Employee Satisfaction
– Attrition
– Number of open positions
� Tech Support– Number of support calls
– Number of closed cases
– Customer satisfaction
– Duration of calls
� Delivery
– Delivery times
– Backlog
– Inventory
� Production
– Number of produced units
– Production times
– Number of defects
� Web Services
– Number of visitors
– Number of page hits
© Andreas Geppert Spring Term 2020 Slide 102
Dashboard: Comparisons of Data
� The same measure at the same point in time in the past
� The same measure at a different point in time in the past
� The current target for the measure
� Relationship to a target in the future
� A past prediction of the measure
� A typical/standard value for the measure
� An extrapolation of the measure into the future
� Another version of the measure
� A different but related measure
© Andreas Geppert Spring Term 2020 Slide 103
Dashboards: Bad Practices
� Distribution of information across multiple screens
� Missing context
� Excessive detail or precision
� Inadequate metrics
� Inadequate representation
� Redundancy
� Inadequate design
� Inadequate coding of quantitative data
� Inadequate emphasis of important information
� Useless decoration
� Inadequate use of colors
© Andreas Geppert Spring Term 2020 Slide 104
Balanced Scorecards
� Concept-oriented business intelligence applications
� Objective: balanced control and steering of the enterprise and ist
constituent parts
� a Balanced Scorecard supports measurement, communication and
control of strategic enterprise targets
� Balanced view of four areas
– Finance
– Customers
– processes (operations)
– Learning and development (people)
© Andreas Geppert Spring Term 2020 Slide 105
Scorecards: Approach
� Definition of strategic goals
� Assignment of at least one key figure to each goal
� Definition of target values for each key figure
� Definition of actions to achieve goals
� Measurement of goal achievement
� Break-down of BSC for subordinate organizational units
� Extension onto functional units like HR, IT
© Andreas Geppert Spring Term 2020 Slide 106
Scorecard Tools
� Recommendations of the Balanced Scorecard Initiative
� BSC design
– Implementation of a BSC containing the BSC approach
� Strategy communication
– Documentation and communication of the BSC elements: Goals, target values, key
figures
� Monitoring of implementation
– Monitoring of measures
� Feedback and adaptation
– Reporting of key figures
– Status of target fulfillment using advanced visualization (s. Dashboards)
– Possibility to comment
© Andreas Geppert Spring Term 2020 Slide 107
Outline
1. Analytic Applications
– Classifications and Architecture
– Semantic Models
2. Query Languages: SQL
3. Reporting
4. Query Languages: MDX
5. OLAP
6. Visualization
7. Dashboards and Scorecards
8. Big Data
© Andreas Geppert Spring Term 2020 Slide 108
Big Data in the Press
� In 2011, McKinsey
estimated that Big Data
can contribute …
� … $300 bn potential
annual value to US
health care
� … €250 billion potential
annual value to
Europe’s public sector
administration� J. Manyika et al: Big data: The next frontier
for innovation, competition, and productivity.
McKinsey Global Institute, May 2011
© Andreas Geppert Spring Term 2020 Slide 109
Big Data Motivation
� processing and analysis of data has traditionally been done in relational databases, using SQL as “inter-galactic data speak”
� in some application areas there is a tremendous increase in data volumes– RDBMS are not easily able to ingest such data volumes– Facebook applications create up to several dozens of TB new data each day (!)– each airplane generates several TB of data on a single flight
� data “types” are not handled well by RDBMS– particularly machine-generated data (web logs, sensor data), social media– those data often are un- or semi-structured or of varying structure
� analysis and processing styles are not very well supported by RDBMS/SQL– analysis of semi- and unstructured data– graph analysis, natural language processing
� data might not be “worth” being stored in a relational database� The451 Group: SPRAIN-requirements are not very well met by current
RDBMS– SPRAIN: Scalability, Performance, Relaxed consistency, Agility, Intricacy, Necessity
� emergence of Big Data (and NoSQL)
© Andreas Geppert Spring Term 2020 Slide 110
Initial Big Data Technology: Map/Reduce andHadoop
� Map/Reduce has been invented and
first implemented/used at Google
� Hadoop is an Apache project
implementing a runtime environment
for Map/Reduce
� in Hadoop, Map and Reduce functions
can be implemented in Java (other
languages are supported as well)
HDFS
Map/Reduce
Hive Hbase Pig
© Andreas Geppert Spring Term 2020 Slide 111
Map/Reduce and Hadoop
� Map/Reduce is (one of) the most prominent approaches to process Big
Data
� It is a highly scalable approach for processing large data volumes in
parallel
� Map/Reduce can run on clusters consisting of thousands of commodity
servers
� the map phase is executed in parallel and computes sets of key/value
pairs
� the reduce phase (also executed in parallel) combines pairs with equal
keys
© Andreas Geppert Spring Term 2020 Slide 112
Map/Reduce: Example Word Count
database/1
column/5
data/5
database/2
data/6
database/(1,2)
column/(5)
data/(5,6)
In a traditional database, data (the
records in a table) are stored row-wise
(i.e., the primary key together with all
the other attributes values). This is
efficient for requests that retrieve one
or only a few rows, require all
attributes, or are write-intensive. …
NoSQL database systems are DBMS or
other kinds of data management
systems that do not offer a SQL-
interface, or at least not only. In general,
NoSQL systems store key/value pairs
and the access to data is primarily via
the primary key (i.e., no scans, range
queries, etc.).
input map shuffle reduce
column/5
data/11
database/3
result
© Andreas Geppert Spring Term 2020 Slide 113
Big Data Evaluation from a Database Perspective
� Initial Hadoop approaches raise challenges addressed by
(relational) databases long ago for relational query operators
� how to parallelize operations
– communication between phase becomes a critical cost factor
� algorithm design
– application implementors need to design algorithms (or at least evaluate
candidates) from a complexity and cost perspective
� abstraction and declarativeness missing
© Andreas Geppert Spring Term 2020 Slide 114
DWH and Big Data: Delineation
� first generation Big Data/Hadoop vendors aimed at replacing
data warehouses
– not realistic (and no longer claimed)
� Hadoop & Co as massively scalable and parallel ETL engines
Map/Reduce
SQL / BI
© Andreas Geppert Spring Term 2020 Slide 115
DWH and Big Data: Delineation (2)
� DWH and Big Data as complements for different use cases
� each of them is used for use cases it can handle well (see
below)
� An often overlooked aspect is the typically much better data
quality in data warehouses
� Both together form a data lake
SQL / BIBig Data
see M. Selvage: Decision Point for Logical Data Warehouse Implementation Style.
Research ID G00250883, Gartner, May 2013
© Andreas Geppert Spring Term 2020 Slide 116
General Big Data Use Cases
� typical use cases– customer analysis
– product analysis (sentiment analysis, opinion mining)
– graph analysis
– monitoring, analysis, and planning of operations
– security
– legal and compliance
– stock and fund performance predictions
� data used for such use cases– social media
– log data (web and other logs)
– sensor data
– RFID events
– location data
– weather data (historical and forecast)