Data Warehousing Analytic Applications and Business ......Layered Architecture Data Marts Reporting and Analysis Services Front Domain Integration and Enrichment End Integration, Aggregation,

Spring Term 2020 Slide 1

Data Warehousing

Analytic Applications and

Business Intelligence

Spring Term 2020Dr. Andreas [email protected]

© Andreas Geppert Spring Term 2020 Slide 2

Outline of the Course

� Introduction

� DWH Architecture

� DWH-Design and multi-dimensional data models

� Extract, Transform, Load (ETL)

� Metadata

� Data Quality

� Analytic Applications and Business Intelligence

� Implementation and Performance


Outline

1. Analytic Applications

– Classifications and Architecture

– Semantic Models

2. Query Languages: SQL

3. Reporting

4. Query Languages: MDX

5. OLAP

6. Visualization

7. Dashboards and Scorecards

8. Big Data

9. (Data Mining)


GUIReporting, OLAP,Data Mining

Selection,Aggregation,Calculation

Credit Suisse DWH Reference Architecture V5

…

(Meta)data

Management

Layered Architecture

Data MartsReporting and

Analysis Services

FrontEndDomain Integration and Enrichment

Integration, Aggregation, Calculation

Staging AreaData

SourcesFederated Integration

Reference/

Master

Data

integration enrichment

logic;

extract, transform, load

logic

(no ETL)Legend:

data

flowrelationaldatabase

multidimensionaldatabase

file


Analytic Systems

Concept-oriented Systems

Balanced Scorecard

Planning and Budgeting

Consolidation

Value-oriented Management

Generic Systems

Ad-hoc Analysis Systems

Free OLAP Analysis

Guided OLAP Analysis

Free Data Retrieval

SQL

MDX

Reporting Systems

Interactive Reporting Platforms

Generated Reports

Model-based Analysis Systems

Decision Support Systems

Expert Systems

Data Mining


IT Developers

Production Reporting Tools Statistics

Analysts &

Information Workers

BI Spreadsheets

OLAP

Business Query

Executives &

ManagersDashboards

Interactive Fixed Reports

Scorecards

Front-Line

WorkersEmbedded BI

BI Search

Customers,

Suppliers, Regulators

Published Reports

BI-Tools and (IT-) Skills

Specialization

© Andreas Geppert

Frühlingssemester 2008 Slide 6

Source: Howson 2008


Decisions: frequency and Economic Impact

High-impact,

infrequent decisions

Ex: M&A,

capital investment,

strategic market

positioning

Medium-impact,

medium frequent

Decisions

Ex: product development-

and pricing,

customer segmentation

Low-impact,

frequent decisions

Ex: Loan request,

Cross-sell offers,

customer upgrade

Frequency of Decision

Eco

no

mic

Im

pa

ct o

f In

div

idu

al

De

cisi

on

s

Source: Howson 2008


BI System Architecture

BI Server

(Caching Optimization Security Workflow)

DWH CubeERP

CRM OLTPSpread

sheet

Semantic Layer

Tool ToolSpread

sheet


Semantic Models

� Especially business users and in ad-hoc reporting, skills to

access relational data models via SQL cannot be expected

� semantic models (semantic layers) form an intermediate layer

between database/DWH and users:

– Semantic models are closer to business language and terminology than

relational models and star schemas

– Semantic models abstract from database structures: joins and aggregations

etc. can be hidden in the mapping of the semantic layer onto database

structures

– Ideally all the required BI tools integrate with the semantic layer

� Examples: Business Objects Universes, OBIEE


Outline

1. Analytic Applications– Classifications and Architecture– Semantic Models


Star Joins

Super groups Aggregate and analysis functions Local grouping

3. Reporting4. Query Languages: MDX5. OLAP6. Visualization7. Dashboards and Scorecards8. Big Data


Ex

am

ple

Sc

he

ma


Query Patterns: Star Queries

� Queries against star schemas

� Joins fact table with some or all of the dimension tables

Star queries, star join

� Typically restrictions in dimension tables

� Typically also grouping and aggregation in the resulting table


Example: Star Queries

� Sales per product, store, and day

select p.product_name, s.store_name, t.the_date,

sum(f.unit_sales) as sales

from sales_fact f, store s, product p, time_by_day t

where f.product_id = p. product_id

and f.store_id = s. store_id

and f.time_id = t. time_id

group by p.product_name, s.store_name, t.the_date;


Outline


2. Query Languages: SQL Star Joins

Super groups

Aggregate and analysis functions Local grouping



Grouping in SQL

� traditionally:

– group-by clause

– Per query, there is a fixed set of grouping criteria

� Suboptimal for flexible grouping

– Along multiple dimension

– On multiple levels of a dimension hierarchy

� all combinations of grouping attributes over G1,...,Gn ?

– 2n queries with corresponding grouping criteria

– (product, store, date)

[(product, store, date), (store, date), (product, date), (product, store)

(product), (store), (date), ()]

Super groups

– Grouping sets, rollup, cube


Grouping Sets

� Groups along multiple grouping criteria

� In a single query!

Grouping sets

Explicit listing of all grouping criteria

� Example: sum of sales, per

– product,

– day,

– product and day

in a single, same query


Grouping Sets: Sample Data

PRODUCT SALES_DATE SALES_CNT

--------- ---------- ---------

Cornetto 06.06.06 15

Magnum 06.06.06 25

06.06.06 5


Magnum 07.07.06 33

07.07.06 6

Cornetto 44

Magnum 9


Grouping Sets: Example

select product, sales_date, sum(sales_cnt) total

from kiosk

group by grouping sets ( (product),

(sales_date),

(product, sales_date));


Grouping Sets: Example (2)

� Possible result:PRODUCT SALES_DATE TOTAL

-------- ---------- -----

Magnum (null) 9

Cornetto (null) 44

(null) 06.06.06 5

Magnum 06.06.06 25


(null) 07.07.06 6

Magnum 07.07.06 33


(null) (null) 53

(null) 06.06.06 45

(null) 07.07.06 61

(null) (null) 11

Magnum (null) 67

Cornetto (null) 81


Grouping Function

� SQL queries generate null values

– Meaning: „all“

– Set of values cannot be represented in first normal form

� Additionally there may be null values in the data

– Meaning: not existing, unknown

In the query result, the meaning of NULL (-) is not obvious

Grouping function

Shows whetherthe row is the result of grouping over the column

0: NULL has been in the data

1: NULL because of grouping ("all")


The Grouping Function: Example

� Similar query as above

select product, grouping(product) as prodgrp,

sales_date, grouping(sales_date) dategrp,

sum(sales_cnt) as total

from kiosk


(sales_date),

(product, sales_date))


The Grouping Function: Example (2)

� result:product prodgrp sales_date dategrp total --------------- ---------- -------------- ------- ---- -----------Cornetto 0 06.06.06 0 15

Cornetto 0 07.07.06 0 22

Cornetto 0 0 44

Cornetto 0 1 81

Magnum 0 06.06.06 0 25

Magnum 0 07.07.06 0 33

Magnum 0 0 9

Magnum 0 1 67

0 06.06.06 0 5

1 06.06.06 0 45

1 07.07.06 0 61

0 07.07.06 0 6

1 0 53

0 1 11



� Query as above, with declaration of the «all» values:

select decode(grouping(product), 1, 'All Products', product)

decode(grouping(sales_date), 1, 'All Dates', sales_date),

sum(sales_cnt) as total

from kiosk


(sales_date),

(product, sales_date));



� Result:product date total --------------- ------- ---- -----------All Products 06.06.06 45

All Products 07.07.06 61

All Products (null) 53



Cornetto (null) 44

Cornetto All Dates 81

Magnum 06.06.06 25

Magnum 07.07.06 33

Magnum (null) 9

Magnum All Dates 67

(null) 06.06.06 5

(null) 07.07.06 6

(null) All Dates 11


The Cube Operator

� Grouping with all possible combinations?

� G1...Gn 2n criteria with grouping sets

abbreviation: the cube operator

� cube(G1...Gn) grouping sets( 2{G1...Gn})

� Example: cube(A, B) grouping sets((A,B), (A), (B), ())

� ( ): grand total


The Cube Operator: Example

� Sales grouped by:

– Product family

– Customer country

– Product family and customer country

– And overall sum (grand total)

select product_family, country, sum(store_sales)

from sales_fact, product, product_class pc, customer c

where ...

group by cube(pc.product_family, c.country);


The Cube Operator: Example(2)

PRODUCT_FAMILY COUNTRY sales

-------------- ------- ------------

Drink Canada 14256.53

Drink Mexico 57991.18

Drink USA 150349.99

Drink - 222597.70

Food Canada 124649.63

Food Mexico 568431.52

Food USA 1376761.54

Food - 2069842.69

Non-Consumable Canada 30845.95

Non-Consumable Mexico 137558.68

Non-Consumable USA 329025.81

Non-Consumable - 497430.44

- Canada 169752.11

- Mexico 763981.38

- USA 1856137.34

- - 2789870.83


The Rollup Operator

� Often we are not interested in all possible grouping criteria

� But mainly in all aggregates along a (subset of a) dimension

hierarchy

� This is, we would like to see a stepwise rollup

Rollup operator

� Computes n grouping combinations + grand total

� rollup(Family, Department, Product)

grouping sets((Family, Department, Product),

(Family , Department), (Family), ()


The Rollup Operator: Example

� Sum of sales per

– Product family and product department

– Product family

– overall (grand total)

select product_family,product_dep't,sum(store_sales)

from sales_fact f, product p, product_class pc

where ...

group by rollup(pc.product_family,

pc.product_department)


The Rollup Operator: Example (2)

PRODUCT_FAMILY PRODUCT_DEPARTMENT SALES

-------------- ------------------ ----------

Drink Alcoholic Beverages 68118.28

Drink Beverages 123181.16

Drink Dairy 31298.26

Drink - 222597.70

Food Baked Goods 107589.71

Food Baking Goods 185836.08

Food Breakfast Foods 36523.92

Food Canned Foods 183554.39

Food Dairy 152413.99

Food Eggs 43091.95

Food Frozen Foods 287099.75

... ... ...

Food - 2069842.69

Non-Consumable Health and Hygiene 144139.47

Non-Consumable Household 283380.38

Non-Consumable Periodicals 40860.55

Non-Consumable - 497430.44

- - 2789870.83


Outline


2. Query Languages: SQL Star Joins Super groups

Aggregate and analysis functions

Local grouping



New Aggregation Functions in SQL

� Traditional aggregate functions

– sum, count, min, max, avg

– Operate on entire columns or on partitions resulting from a traditional group-by

clause („global grouping“)

� This is often not sufficient for analytic queries

– For instance, aggregates (or more general, calculated values) should be

computed based on a subset of the result set relative to single tuples(rows

� „local“ grouping

– Other terms: window functions, OLAP functions, new aggregate functions

� Examples:

– Ranking: top seller

– Numbering of rows

– Position of elements in a list


Rank Functions

� Rank functions

– Assign a numeric value to each individual row

– Rank = position of the element in a sorted list

– Not that the position of a tuple in a list obtained by a sort-by clause is

implicit!

� Rank operator

– Sort criterion is specified in a (new) order clause

– If two or more elements tie, the following ranks are not assigned

� denserank: no gaps in ranks

� Numbering function:

– Enumerates tuples

– never assigns equal numbers to tuples

– row_number


Rank Functions: Examples

A B C

a1 b1 9a2 b2 8a3 b1 8a4 b2 6a5 b1 5a6 b2 4

a7 b1 3

rank () over(order by c)

3

1

2

4557

denserank () over(order by c)

3

1

2

4556

row_number() over(order by c)

3

1

2

4567


Rank Functions: Examples

� The list of products, ordered by total sales, including

position in the sorted list

select p.product_name,

f.storeSales,

rank() over(order by f.storeSales desc)

as salesRank

from productSalesV f join

product p on f.product_id = p.product_id

order by salesRank asc;


Rank Functions: Examples (2)

PRODUCT_NAME STORESALES SALESRANK

------------------------------- --------------- ---------

Carrington Turkey TV Dinner 11753.84 1

Big Time Apple Cinnamon Waffles 11585.34 2

CDR Vegetable Oil 11493.60 3

High Quality 60 Watt Lightbulb 11408.76 4

Ebony Lettuce 11371.86 5

...

Super Columbian Coffee 245.00 1558

Top Measure Chardonnay Wine 240.12 1559


Rank Functions: Examples (3)

� The ten best-selling products

select *

from table(select p.product_name,

f.storeSales,

rank() over(order by f.storeSales desc)

as salesRank

from productSalesV ...) sr

where salesRank <= 10

order by salesRank;


Outline


2. Query Languages: SQL Star Joins Super groups

Aggregate and analysis functions Local grouping



Local Grouping

� Calculate aggregates based on partitions given by the context of

single tuples

«local» grouping

Each individual tuple defines an aggregate (possibly together with other

tuples)

� Comparisons to other aggregates are possible

– For instance, change in monthly sales compared to the previous year

� Moving average

– For instance, product sales averaged over previous, current, and following

month

� Cummulated sums

– For instance, monthly sums added up to current month (year-to-date)


Local Grouping (2)

� Hard to formulate (if possible at all) with traditional SQL

– Possibly multiple SQL statements are required

Local grouping

One output tuple per input tuple

Local partitioning criterion

Local ordering criterion

Window size


Local Partitioning

� Partition is defined based on «current» tuple

� Aggregates will be calculated over this partition

� partition-by clause

� Typically used for pre-aggregated data

� Often used to compute ratios (ratio-to-report)


Local Partitioning (2)

A B Ca1 b1 6a2 b2 5a3 b1 5a4 b2 4a5 b1 3a6 b2 2a7 b1 1

15

1511

11151115

14

1511

91156

sum(c) over(partition by b)sum(c) over(partition by b order by a)


Local Partitioning: Example

� Sum of sales per day plus ratio to monthly (percent) and yearly sales (per mill)

select theDate, storeSales,

sum(storeSales) over(partition by month(theDate))

as mSales,

100 * storeSales /

sum(storeSales) over(partition by month(theDate))

as monPcnt,

sum(storeSales) over(partition by year(theDate))

as ySales,

1000 * storeSales /

sum(storeSales) over(partition by year(theDate))

as yearPmil

from timeSalesV;


Local Partitioning: Example (2)

thedate storesales msales monpcnt ysales yearpmil

---------- ---------- ---------- ------- ----------- -----

01/01/2003 1139.06 228289.43 0.00 890572.78 1.00

...

01/05/2003 2877.56 228289.43 1.00 890572.78 3.00

...

02/15/2003 4939.80 217254.31 2.00 890572.78 5.00

02/16/2003 3195.47 217254.31 1.00 890572.78 3.00

...

01/02/2004 6379.30 228289.43 2.00 1899298.05 3.00

...

02/01/2004 6706.19 217254.31 3.00 1899298.05 3.00

...


Local Grouping: Calculation of Ratios

� Calculation of ratios using ratio_to_report� Determines the relative share that a value contributes to a

sumselect …

ratio_to_report(sales)

over (partition by month(theDate)) * 100

as monPcnt

…


Windows

� Windows can be moved over a (intermediate) table

� Aggregates will be computed over the data «visible through the

window»

� Window (size) defined in terms of the current tuple

Window clause

� options:

– Position based:

n (or all, none) tuples before T (in the specified sort order)

n (or all, none) tuples after T (in the specified sort order)

– Value based


Windows (2)

Bb1b2b1b2b1b2b1

A Ca1 6a2 5a3 5a4 4a5 3a6 2a7 1

partition by b order by a

rows between 1 preceding and 1 following


Moving Average: Example

� Three-month average of the sum of sales

select monat, jahr, storeSales,

avg(storeSales)

over(partition by jahr

order by monat

rows between 1 preceding and 1 following)

as avg_3_mon

from monthSalesV


Moving Average: Example (2)

MONAT JAHR STORESALES AVG_3_MON

----------- ----------- ---------- -----------

1 2003 70923.85 71068.94

2 2003 71214.04 74140.24

3 2003 80282.83 72513.60

4 2003 66043.94 72294.36

...

9 2003 69036.89 68672.30

10 2003 64944.22 72704.42

11 2003 84132.16 79524.97

12 2003 89498.53 86815.34

1 2004 157365.58 151702.92

2 2004 146040.27 153100.34

3 2004 155895.17 150537.91

...


Cumulated Sums

� Individual tuples contribute successively to the computation of a

sum

� Sum in step 1 (per partition!) = attribute value of the first tuple

� Sum in step n+1 (per partition!) =

sum of step n + attribute value of the n+1st tuple

� Result tuples represent cumulated sums


Cumulated Sums: Example

� Monthly sales and sum of sales up to and including the current

month

select monat, jahr, storeSales,

sum(storeSales) over(partition by jahr

order by monat

rows unbounded preceding) cum_Sales

from monthSalesV;


Cumulated Sums: Example (2)

MONAT JAHR STORESALES CUM_SALES

----------- ----------- ---------- -------------

1 2003 70923.85 70923.85

2 2003 71214.04 142137.89

3 2003 80282.83 222420.72

4 2003 66043.94 288464.66

5 2003 70556.32 359020.98

...

11 2003 84132.16 801074.25

12 2003 89498.53 890572.78

1 2004 157365.58 157365.58

2 2004 146040.27 303405.85

3 2004 155895.17 459301.02

4 2004 149678.31 608979.33

...


Processing of Analytic Queries

� Data filtering (WHERE)

� (global) grouping (GROUP-BY)

� Filtering of aggregates (HAVING)

� Computation of analytic functions– Each analytic function is computed for itself

– Creation of partitions

– Sorting of partitions

– Application of ranking or aggregate functions

� Sorting of final result (ORDER BY)

� Note that WHERE and HAVING are applied before the

computation of analytic functions

Results of analytic functions cannot be referred to in these

clauses


Outline



– Semantic Models


3. Reporting


5. OLAP

6. Visualization


8. Big Data


Reporting

� Pre-defined, created and distributed periodically

� Or defined, created, and consumed on-demand

� Report consists in general of data and layout

� Data are typically obtained through database queries

� Layout can be tabular and/or graphical

� Often reporting is distinguished according to the purpose or

domain

– Management reporting

– Performance reporting

– Financial reporting

– Regulatory reporting

– technical reporting (e.g. availability reporting)


Reporting: Standard Reports

� Pre-defined

� Created regularly (e.g., month end)

� Distributed to consumers

� Replaces traditional, paper-based reporting

� Possibly with very high requirements and expectations regarding

layout and look-and-feel («pixel-perfect reports»)

� Reports are often defined by specialized IT staff

� Reports are typically developed as regular software projects


Reporting: Parameterized Reports

� Very similar to standard reports

� Report and database query contain formal parameters which are

instantiated at report generation time

� Actual parameters are then specific for individual report

consumers

– Access to such reports often needs to be restricted to a small group of

consumers

� Example: net new assets per relationship manager

� (very basic) drill down can be implement by linking

parameterized reports


Reporting: Phases

� Report definition

– Tool-based

– Should ideally be possible without deep database or IT skills

– Graphical interface, report editor

– Metadata support is esssential

� Report creation

– Data are extracted by executing queries

– Depending on the report definition and database query, the reporting tool

can also perform some kind of processing


Reporting: Phases (2)

� Formatting

– Report is formatted according to the layout definition

– Tables and/or charts

� Publishing and distribution

– Report can be stored on the reporting tool’s infrastructure

– Consumers can be notified about availability of the report

– Reports may also be distributed via email directly


Reporting: Ad-hoc Reports

� Not pre-defined, satisfy an urgent, one-off information need

� Reporting phases collapse, in particular the time gap between

report definition and generation does not exist

� In orderto provide the required agility, ad-hoc reports cannot be

developed with the same rigorous project approach as

standard/parameterized reports

� Typically ad-hoc reports should be definable by end-users, at

least power users

– Implies ease-of-use requirements of the reporting platform and tool

– Layout requirements are typically less strict


Outline



– Semantic Models


3. Reporting


5. OLAP

6. Visualization


8. Big Data


MDX

� Multidimensional Expressions

� Initially proposed by Microsoft

� Query language of SQL Server Analysis Services (previously

OLAP Services)

� In the meantime also supported by other multidimensional

database systems (e.g. Essbase, Mondrian, Alphablox, ...)


Structure of MDX Statements

� SELECT (axis dimensions)– columns: set of elements ON COLUMNS

– rows: set of elements ON ROWS

– ... plus ON PAGES, SECTIONS, CHAPTERS, ...

� FROM (cube specification)– Reference to typically a single cube

– In principle a multi-dimensional join between cubes is possible as well

� WHERE (“Slicer” dimensions)– Restriction of the data range

� Measures of a cube– Elements of the mandatory dimension “Measures”

– Standard aggregation operators are defined on schema level (sum, min,

max, count)


Sample MDX-Query

select {Produkt.Abteilung.Members} on Columns,

{Standort.Kanton.Members} on Rows

from KioskSales

where (Measures.[Anzahl Verkäufe] )

Lebensmittel Schreibwaren Zeitschriften

Zürich 2 2

Aargau 2 2 5

Uri 2 2


Set Expressions

� Enumeration

– {USA, CA, SF, SJ, Aargau}

� Element expressions

– Schweiz.CHILDREN: returns Cantons {ZH, AR, AG, ....}

– ZH.PARENT: returns Switzerland

– DESCENDANTS(Schweiz, Cities): Decendants on level Cities

– Time.Quarter.MEMBERS: Enumeration of all elements of a dimension

hierarchy level


Set Expressions (2)

� Creation of sets

GENERATE ({USA, Schweiz},

DESCENDANTS(Geography.CURRENT, Cities))

– Enumerates all cities in Switzerland and USA

� Nesting sets

CROSSJOIN({USA, Schweiz}, {Mike, John}):

{(USA, Mike), (USA, John), (Schweiz, Mike), (Schweiz, John)}


Sample MDX-Query with Crossjoin

select Produkt.Abteilung.Members on Columns,

Crossjoin (Standort.Kanton.Members,

{Datum.[2002].Januar,

Datum.[2002].Februar})

on Rows

from KioskSales

where (Measures.[Anzahl Verkäufe] )

Lebensmittel Schreibwaren Zeitschriften

ZürichJanuar 1

Februar 1

AargauJanuar 1 1

Februar 2 2

UriJanuar 1

Februar 1 1


Set Expressions (3)

� Relative reference

– Zeit.[1999].LastChild: fourth quarter1999

– [1999].NextMember: 2000

– [1990]:[2000]: [1990], ..., [2000]

� Level functions

– Schweiz.LEVEL: returns Country

– Zeit.LEVELS(1): returns Year (counting top down)


Special Functions

� TOPCOUNT, TOPPERCENT, TOPSUM

SELECT {[Anzahl Verkäufe]} on COLUMNS,

{TOPCOUNT(Schweiz.CHILDREN, 5,

Sales)}

ON ROWS

FROM KioskSales

WHERE ([Anzahl Verkäufe], [2005])


Special Functions (2)

� FILTER

– in a WHERE clause, only slicers can be specified

– For specification of predicates, filters have to be used

SELECT FILTER({Schweiz.CHILDREN},

([2005],Sales) > 500) ON COLUMNS,

Quarters.MEMBERS ON ROWS

FROM KioskSales

WHERE ([AnzahlVerkäufe], [2005])


Summary: MDX

� Powerful language for the specification of OLAP queries

– Top-level structure similar to SQL

– Set expressions provide elengant ways to operate on dimension hierarchies

� Functionality

– Many OLAP functions

– Derived measures (WITH clause)

presentation aspects (on rows/columns) as part of queries


Outline



– Semantic Models


3. Reporting


5. OLAP

6. Visualization


8. Big Data


OLAP: "Definition"

� Rules by Codd and others

� 12 rules for the evaluation of OLAP products

� Later extended with 6 further features

� Re-grouped into four groups:

– Basic features

– Special features

– Reporting features

– Dimension control


OLAP: Codd’s Rules

1. Multidimensional conceptual view

2. Transparency

Transparency of the architecture and the database environment (data

origin)

3. Accessibility

Integration of heterogeneous schemas and data

4. consistent reporting performance

No performance degradation when number of dimensions grows or

database size increases

5. Client/Server architecture

Logically and physically


OLAP: Codd’s (2)

6. generic dimensionality

7. Dynamic management of sparse cubes

8. Multi-user mode

9. Unrestricted operations across dimensions

10. Intuitive data manipulation

11. Flexible reporting

12. Unrestricted dimensions and aggregation


OLAP: Definition by the OLAP Council

� On-Line Analytical Processing (OLAP) is a category of software technology that enables

analysts, managers and executives to gain insight into data through fast, consistent,

interactive access to a wide variety of possible views of information that has been

transformed from raw data to reflect the real dimensionality of the enterprise as understood

by the user.

� OLAP functionality is characterized by dynamic multi-dimensional analysis of consolidated

enterprise data supporting end user analytical and navigational activities including: – calculations and modeling applied across dimensions, through hierarchies and/or across members

– trend analysis over sequential time periods

– slicing subsets for on-screen viewing

– drill-down to deeper levels of consolidation

– reach-through to underlying detail data

– rotation to new dimensional comparisons in the viewing area

� OLAP is implemented in a multi-user client/server mode and offers consistently rapid response

to queries, regardless of database size and complexity. OLAP helps the user synthesize

enterprise information through comparative, personalized viewing, as well as through analysis

of historical and projected data in various "what-if" data model scenarios. This is achieved

through use of an OLAP Server.


OLAP: FASMI-Test

� Coined by Nigel Pendske

� Fast Analysis of Shared Multi-dimensional Information

� Fast

– Analytic queries must be executed efficiently

– Especially when queries are ad-hoc and interactive

– Balance between:

Pre-computation ( database explosion) and

on-the-fly computation (Performance)


OLAP: FASMI-Test (2)

� Analysis

– Relevant business logic and statistical analysis use cases are supported

– Comprehensible for end users

– Ad-hoc calculations and analysis

� Shared

– Multi-user access

– security

� Multi-dimensional

– Full support for dimensions, hierarchies, including parallel hierarchies

� Information

– Ability to handle large data volumes

– "Input", not consumed storage!


MOLAP: General Architecture

� Support for analysis of

multi-dimensional data

� Multi-dimensional structures

as storage objects

Client

Server

Data Store


MOLAP

� Physical storage of cubes: nested arrays

� Arrays contain finest granularity required for analysis

� Designed for multi-dimensional analysis

� multidimensional OLAP, MOLAP

� Efficient execution of analytic queries possible (depending on query and

design)

� With fine granularity, most of the cube cells are empty (typically > 95%)– Compression of spare dimensions and subcubes

– Complex physical design

� Scalability becomes a problem (performance degrades with growing

cubes)

� Storage of finest granularity is then no longer possible– Use coarser granularity

– Analysis of detail data no longer possible


ROLAP: High-level Architecture

Client

Server

Data Store


ROLAP (2)

� Use relational database systems as storage system

� Map multidimensional structures onto relational tables

(Star-Schema)

� Implementation of management of and queries against

multi-dimensional structures with SQL

� relational OLAP, ROLAP


ROLAP (3)

� Unrestricted number of dimensions

� Management of very large data volumes possible (many TB)– Good scalability

� Skills and experiences are often available

� Analysis of detail level possible

� Execution of complex analytic queries not always possible– See section on SQL

� Performance (query response time) often worse than with

MOLAP– Usage of caches to improve performance (reduce I/O)

� Extensions of relational systems– New operations (some of them already standardized)

– Improved implementations and progress regarding optimizers and access

paths (Teradata, DB2, Oracle, SQLServer)


HOLAP: High-level Architecture

Client

Server

Data Store


HOLAP (2)

� hybrid OLAP

� Tries to combine the advantages of ROLAP and MOLAP

� Multi-dimensional DBS

– Stores aggregates (coarse granularity)

– Ability to analyze detail data

� Drill-through to relational tables

� Cubes and tables can be stored in the same of in different

database systems


Outline



– Semantic Models


3. Reporting


5. OLAP

6. Visualization


8. Big Data


Data Visualization

Data visualization

• encompasses all sorts of visual representation supporting the exploration, investigation, and communication of data (S. Few)

Visualization of Information

• (vs. scientific visualization): the use of computer-supported, interactive, visual representations of abstract data to amplify recognition”(Card, Mackinlay, Shneiderman)


Data Visualization

� s. Few2009

0

500

1000

1500

2000

2500

3000

3500

4000

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Inland

Ausland


Visual Perception

� Text and tables are read and processed subsequently value by

value

� Graphs can be consumed as a whole

� Characteristics that can be perceived particularly easily:

– Position (2D)

– Length

– Width

– Area

– Form

– Color

– Orientation

� “pre-attentive attributes of visual perception”


Visual Perception: Basic Elements

Points

■ Points

– Two-dimensional position

■ Lines

– Two-dimensional position +

connections

■ Columns

– Height or length

■ Boxes0

1

2

3

4

5

6

A B C D E


Visual Perception: Basic Elements

Lines Columns

0

1

2

3

4

5

6

A B C D E

0

1

2

3

4

5

6

A B C D E


Visualization Best Practices

� Graph types

� Dimensionality

� Trellis charts small multiples


Charts

� Bar and column charts

– in many cases the best-suited type of way to

visualize quantitative information

– Comparisons, maximum and minimum are

easy to detect

– In order to enable reasonable comparisons,

the x axis must intersect the y axis at 0

� Line charts

– Well-suited to visualize the trend of

quantitative information over time

– Can be meaningfully combined with column

charts


Charts (2)

� Pie charts

– The chart type that is most often mis-used

– Well suited (if at all) for visualizing proportions

– Comparisons are often difficult, because arcs or areas have to be compared

� Maps

– Visualization of geo-coded data

– Visualization of geographical concentration

� Scatterplot

– Represents relationship (or the lack thereof) between two variables

– Identification of trends, clusters, outliers


Charts (3)

� Bubble Charts

– Not a chart type in ist own right

– Represents additional information in other charts such as maps or

scatterplots

� Heatmaps

– Colored visualization of the relationship between two variables

– Third variable can be represented via the size (area) of the rectancles


Dimensionality

� Two-dimensional charts are

preferred

� Three-dimensional charts are

often problematic and hard to

read


Trellis Charts

� A Trellis chart (small

multiples) is a series of

multiple, small, similar

charts

� The series allows one to

visualize an additional

variable or dimension


Outline



– Semantic Models


3. Reporting


5. OLAP

6. Visualization


8. Big Data


Dashboards

� Dashboards contain several related indicators or reports

� Usually together with advanced visualiation elements

– Maps

– Reports with content-based formatting

– Travel light visualiation

– Speedometer etc.

� "A dashboard is a visual display of the most important information

needed to achieve one or more objectives; consolidated and arranged

on a single screen so the information can be monitored on a glance"

(Stephen Few, Information Dashboard Design)


Dashboard Example: WEBeMars

� Source: web-based Emergency Medicine Analysis & Reporting System (http://www.edims.net/webemars.php )


Dashboard: Sample Metrics

� Sales

– Orders

– Invoices

– Sales pipeline

– Number of orders

– Sales prices

� Marketing

– Market share

– Campaing success

– Customer demographics

� Finances

– Turnover

– Costs

– Profit

� HR

– Employee Satisfaction

– Attrition

– Number of open positions

� Tech Support– Number of support calls

– Number of closed cases

– Customer satisfaction

– Duration of calls

� Delivery

– Delivery times

– Backlog

– Inventory

� Production

– Number of produced units

– Production times

– Number of defects

� Web Services

– Number of visitors

– Number of page hits


Dashboard: Comparisons of Data

� The same measure at the same point in time in the past

� The same measure at a different point in time in the past

� The current target for the measure

� Relationship to a target in the future

� A past prediction of the measure

� A typical/standard value for the measure

� An extrapolation of the measure into the future

� Another version of the measure

� A different but related measure


Dashboards: Bad Practices

� Distribution of information across multiple screens

� Missing context

� Excessive detail or precision

� Inadequate metrics

� Inadequate representation

� Redundancy

� Inadequate design

� Inadequate coding of quantitative data

� Inadequate emphasis of important information

� Useless decoration

� Inadequate use of colors


Balanced Scorecards

� Concept-oriented business intelligence applications

� Objective: balanced control and steering of the enterprise and ist

constituent parts

� a Balanced Scorecard supports measurement, communication and

control of strategic enterprise targets

� Balanced view of four areas

– Finance

– Customers

– processes (operations)

– Learning and development (people)


Scorecards: Approach

� Definition of strategic goals

� Assignment of at least one key figure to each goal

� Definition of target values for each key figure

� Definition of actions to achieve goals

� Measurement of goal achievement

� Break-down of BSC for subordinate organizational units

� Extension onto functional units like HR, IT


Scorecard Tools

� Recommendations of the Balanced Scorecard Initiative

� BSC design

– Implementation of a BSC containing the BSC approach

� Strategy communication

– Documentation and communication of the BSC elements: Goals, target values, key

figures

� Monitoring of implementation

– Monitoring of measures

� Feedback and adaptation

– Reporting of key figures

– Status of target fulfillment using advanced visualization (s. Dashboards)

– Possibility to comment


Outline



– Semantic Models


3. Reporting


5. OLAP

6. Visualization


8. Big Data


Big Data in the Press

� In 2011, McKinsey

estimated that Big Data

can contribute …

� … $300 bn potential

annual value to US

health care

� … €250 billion potential

annual value to

Europe’s public sector

administration� J. Manyika et al: Big data: The next frontier

for innovation, competition, and productivity.

McKinsey Global Institute, May 2011


Big Data Motivation

� processing and analysis of data has traditionally been done in relational databases, using SQL as “inter-galactic data speak”

� in some application areas there is a tremendous increase in data volumes– RDBMS are not easily able to ingest such data volumes– Facebook applications create up to several dozens of TB new data each day (!)– each airplane generates several TB of data on a single flight

� data “types” are not handled well by RDBMS– particularly machine-generated data (web logs, sensor data), social media– those data often are un- or semi-structured or of varying structure

� analysis and processing styles are not very well supported by RDBMS/SQL– analysis of semi- and unstructured data– graph analysis, natural language processing

� data might not be “worth” being stored in a relational database� The451 Group: SPRAIN-requirements are not very well met by current

RDBMS– SPRAIN: Scalability, Performance, Relaxed consistency, Agility, Intricacy, Necessity

� emergence of Big Data (and NoSQL)


Initial Big Data Technology: Map/Reduce andHadoop

� Map/Reduce has been invented and

first implemented/used at Google

� Hadoop is an Apache project

implementing a runtime environment

for Map/Reduce

� in Hadoop, Map and Reduce functions

can be implemented in Java (other

languages are supported as well)

HDFS

Map/Reduce

Hive Hbase Pig


Map/Reduce and Hadoop

� Map/Reduce is (one of) the most prominent approaches to process Big

Data

� It is a highly scalable approach for processing large data volumes in

parallel

� Map/Reduce can run on clusters consisting of thousands of commodity

servers

� the map phase is executed in parallel and computes sets of key/value

pairs

� the reduce phase (also executed in parallel) combines pairs with equal

keys


Map/Reduce: Example Word Count

database/1

column/5

data/5

database/2

data/6

database/(1,2)

column/(5)

data/(5,6)

In a traditional database, data (the

records in a table) are stored row-wise

(i.e., the primary key together with all

the other attributes values). This is

efficient for requests that retrieve one

or only a few rows, require all

attributes, or are write-intensive. …

NoSQL database systems are DBMS or

other kinds of data management

systems that do not offer a SQL-

interface, or at least not only. In general,

NoSQL systems store key/value pairs

and the access to data is primarily via

the primary key (i.e., no scans, range

queries, etc.).

input map shuffle reduce

column/5

data/11

database/3

result


Big Data Evaluation from a Database Perspective

� Initial Hadoop approaches raise challenges addressed by

(relational) databases long ago for relational query operators

� how to parallelize operations

– communication between phase becomes a critical cost factor

� algorithm design

– application implementors need to design algorithms (or at least evaluate

candidates) from a complexity and cost perspective

� abstraction and declarativeness missing


DWH and Big Data: Delineation

� first generation Big Data/Hadoop vendors aimed at replacing

data warehouses

– not realistic (and no longer claimed)

� Hadoop & Co as massively scalable and parallel ETL engines

Map/Reduce

SQL / BI


DWH and Big Data: Delineation (2)

� DWH and Big Data as complements for different use cases

� each of them is used for use cases it can handle well (see

below)

� An often overlooked aspect is the typically much better data

quality in data warehouses

� Both together form a data lake

SQL / BIBig Data

see M. Selvage: Decision Point for Logical Data Warehouse Implementation Style.

Research ID G00250883, Gartner, May 2013


General Big Data Use Cases

� typical use cases– customer analysis

– product analysis (sentiment analysis, opinion mining)

– graph analysis

– monitoring, analysis, and planning of operations

– security

– legal and compliance

– stock and fund performance predictions

� data used for such use cases– social media

– log data (web and other logs)

– sensor data

– RFID events

– location data

– weather data (historical and forecast)

Documents

Data Warehousing Analytic Applications and Business ......Layered Architecture Data Marts Reporting and Analysis Services Front Domain Integration and Enrichment End Integration, Aggregation,