20
2014-05-06 1 Data Cleaning and Integration Data Quality • Accuracy • Completeness • Consistency • Timeliness • Believability • Interpretability J. Pei: Big Data Analytics -- Data Cleaning and Integration 2 Data Preprocessing Processing data before an analytic task – Improve data quality – Transform data to facilitate the target task Major tasks – Data cleaning – Data integration – Data reduction – Data transformation J. Pei: Big Data Analytics -- Data Cleaning and Integration 3 Data Cleaning The process of detecting and correcting corrupt or inaccurate records from data Handling missing values Smoothing data J. Pei: Big Data Analytics -- Data Cleaning and Integration 4 Handling Missing Values Ignore records with missing values Fill in missing values – Manually – Using a global constant – Using a measure of central tendency for the attribute, such as mean, median, or mode – Using the central tendency of the class – Using the most probable value J. Pei: Big Data Analytics -- Data Cleaning and Integration 5 Disguised missing data is the missing data entries that are not explicitly represented as such, but instead appear as potentially valid data values Information about "State" is missing "Alabama" is used as disguise Disguised Missing Data? Online forms J. Pei: Big Data Analytics -- Data Cleaning and Integration 6

Data cleaning and integration - ict.ac.cnnovel.ict.ac.cn/files/Day 2.pdf · J. Pei: Big Data Analytics -- Data Cleaning and Integration 23 ... – In TPC, 6 standard benchmarks have

Embed Size (px)

Citation preview

2014-05-06

1

Data Cleaning and Integration

Data Quality

•  Accuracy •  Completeness •  Consistency •  Timeliness •  Believability •  Interpretability

J. Pei: Big Data Analytics -- Data Cleaning and Integration 2

Data Preprocessing

•  Processing data before an analytic task –  Improve data quality – Transform data to facilitate the target task

•  Major tasks – Data cleaning – Data integration – Data reduction – Data transformation

J. Pei: Big Data Analytics -- Data Cleaning and Integration 3

Data Cleaning

•  The process of detecting and correcting corrupt or inaccurate records from data

•  Handling missing values •  Smoothing data

J. Pei: Big Data Analytics -- Data Cleaning and Integration 4

Handling Missing Values

•  Ignore records with missing values •  Fill in missing values

– Manually – Using a global constant – Using a measure of central tendency for the

attribute, such as mean, median, or mode – Using the central tendency of the class – Using the most probable value

J. Pei: Big Data Analytics -- Data Cleaning and Integration 5

•  Disguised missing data is the missing data entries that are not explicitly represented as such, but instead appear as potentially valid data values –  Information about "State" is missing –  "Alabama" is used as disguise

Disguised Missing Data?

Online forms

J. Pei: Big Data Analytics -- Data Cleaning and Integration 6

2014-05-06

2

Disguised Missing Data Is Misleading

J. Pei: Big Data Analytics -- Data Cleaning and Integration 7

•  Wrong conclusion •  Unreasonable results

Types of Disguised Missing Data

•  Randomly choose a valid value as disguise

•  A small number of values are chosen as disguise

J. Pei: Big Data Analytics -- Data Cleaning and Integration 8

Number of customers

0

500

1000

1500

2000

2500

3000

3500

Alabama Ohio Washington

Number of customers

0

500

1000

1500

2000

2500

3000

3500

Alabama Ohio Washington

Real values Disguised missing values

Problem Definition

•  Cleaning disguised missing data

•  Examples –  “Alabama” in “state” –  “0” in “blood pressure” –  “21” in “age”

Given a table T with attributes A, an integer k For each attribute Ai, output k candidates of frequently used disguise values

J. Pei: Big Data Analytics -- Data Cleaning and Integration 9

Ideas

•  Observation 1: Frequently used disguises – A small number of values are frequently used as

the disguises •  Observation 2: Missing at random

– Missing data are often distributed randomly

Number of customers

0

500

1000

1500

2000

2500

3000

3500

Alabama Ohio Washington

A random subset of the whole database

J. Pei: Big Data Analytics -- Data Cleaning and Integration 10

General Framework

•  For each attribute A – For each frequent value v

in A •  Compute the maximal

embedded unbiased sample contained in Tv

– Return the k values with the best (in both quality and size) embedded unbiased sample

J. Pei: Big Data Analytics -- Data Cleaning and Integration 11

Id State Age Gender

1 Alabama 30 M

2 Alabama 30 M

3 Alabama 30 F

4 Alabama 20 F

5 Ohio 20 F

6 Ohio 20 F

Smoothing Noisy Data

•  Noise: a random error or variance in a measured variable

•  Smoothing noise – removing noise

J. Pei: Big Data Analytics -- Data Cleaning and Integration 12

2014-05-06

3

Binning

J. Pei: Big Data Analytics -- Data Cleaning and Integration 13

Sorted data for price (in dollars) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equal-frequency) bins :

Bin1 : 4, 8, 15Bin2 : 21, 21, 24Bin3 : 25, 28, 34

Smoothing by bin means :

Bin1 : 9, 9, 9Bin2 : 22, 22, 22Bin3 : 29, 29, 29

Smoothing by bin boundaries :

Bin1 : 4, 4, 15Bin2 : 21, 21, 24Bin3 : 25, 25, 34

Regression

J. Pei: Big Data Analytics -- Data Cleaning and Integration 14

Outlier Analysis

J. Pei: Big Data Analytics -- Data Cleaning and Integration 15

Data Cleaning as a Process •  Data discrepancy detection

–  Use metadata (e.g., domain, range, dependency, distribution) –  Check field overloading –  Check uniqueness rule, consecutive rule and null rule –  Use commercial tools

•  Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections

•  Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)

•  Data migration and integration –  Data migration tools: allow transformations to be specified –  ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface •  Integration of the two processes

–  Iterative and interactive (e.g., Potter�s Wheels)

J. Pei: Big Data Analytics -- Data Cleaning and Integration 16

Data Integration

•  Combining data from multiple (autonomous and heterogeneous) sources

•  Providing a unified view •  Why is data integration hard?

– Systems challenges – Data logical organization challenges – Social and administrative challenges

J. Pei: Big Data Analytics -- Data Cleaning and Integration 17

Data Integration System Architecture

J. Pei: Big Data Analytics -- Data Cleaning and Integration 18

http://en.wikipedia.org/wiki/File:Dataintegration.png

2014-05-06

4

Wrappers

•  Computer programs that extract content from a particular data source and transform into a target form, such as a relational table

•  Example: CMS (content management system) wrapper

J. Pei: Big Data Analytics -- Data Cleaning and Integration 19

<html> <head> <title> %page_title%</title> </head> <body> %page_content% <P> %page_powered_by% </body> </html>

How to Build Wrappers?

•  Manual construction •  Machine learning based methods: learning

schemas from training data – Supervised learning approaches – Unsupervised learning approaches

J. Pei: Big Data Analytics -- Data Cleaning and Integration 20

Schema Matching and Mapping •  Schema matching: finding the semantic

correspondences between attributes in data sources and those in the mediated schema –  Example: “attribute name in source S1 corresponds to

attributes firstname and surname in the mediated schema

–  Name based matching –  Instance based matching

•  Schema mapping: transforming attribute values from sources to mediated schema –  Example: a query or a program extracting name values

from source S1, and forming firstname and surname values for the mediated schema

J. Pei: Big Data Analytics -- Data Cleaning and Integration 21

Entity Detection and Recognition

•  Entity detection: identify atomic elements in text or other data into predefined categories such as person names, locations, organizations, etc.

•  Entity disambiguation: identify entities carrying the same name

J. Pei: Big Data Analytics -- Data Cleaning and Integration 22

Example

J. Pei: Big Data Analytics -- Data Cleaning and Integration 23

Data Provenance

•  The data about how a data entry came to be – Also known as data lineage/predigree

•  The annotation approach: a series of annotations describing how each data item was produced

•  The graph of data relationships approach: connecting sources and deriving new data items via mapping

J. Pei: Big Data Analytics -- Data Cleaning and Integration 24

2014-05-06

5

Deep / Hidden Web •  Sites that are difficult for a crawler to find

–  Probably over 100 times larger than the traditionally indexed web •  Three major categories of sites in deep web

–  Private sites intentionally private – no incoming links or may require login

–  Form results – only accessible by entering data into a form, e.g., airline ticket queries

•  Hard to detect changes behind a form –  Scripted pages – using JavaScript, Flash, or another client-side

language in the web page •  A crawler needs to execute the script – can slow down crawling

significantly •  Deep web is different from dynamic pages

–  Wikis dynamically generates web pages but are easy to crawl –  Private sites are static but cannot be crawled

J. Pei: Big Data Analytics -- Data Cleaning and Integration 25

1

Multidimensional Analysis

Jian Pei: Big Data Analytics -- Multidimensional Analysis 2

Outline

•  Why multidimensional analysis? •  Multidimensional analysis principle •  OLAP •  OLAP indexes

Dimensions •  “An aspect or feature of a situation, problem, or

thing, a measurable extent of some kind” – Dictionary

•  Dimensions/attributes are used to model complex objects in a divide-and-conquer manner – Objects are compared in selected dimensions/

attributes •  More often than not, objects have too many

dimensions/attributes than one is interested in and can handle

Jian Pei: Big Data Analytics -- Multidimensional Analysis 3

Multi-dimensional Analysis

•  Find interesting patterns in multi-dimensional subspaces –  “Michael Jordan is outstanding in subspaces (total

points, total rebounds, total assists) and (number of games played, total points, total assists)”

•  Different patterns may be manifested in different subspaces – Feature selection (machine learning and statistics):

select a subset of relevant features for use in model construction – a set of features for all objects

– Different subspaces may manifest different patterns

Jian Pei: Big Data Analytics -- Multidimensional Analysis 4

Jian Pei: Big Data Analytics -- Multidimensional Analysis 5

OLAP •  Conceptually, we may explore all possible subspaces for

interesting patterns –  What patterns are interesting? –  How can we explore all possible subspaces systematically and

efficiently? –  Fundamental problems in analytics and data mining

•  Aggregates and group-bys are frequently used in data analysis and summarization SELECT time, altitude, AVG(temp) FROM weather GOUP BY time, altitude; –  In TPC, 6 standard benchmarks have 83 queries, aggregates are

used 59 times, group-bys are used 20 times •  Online analytical processing (OLAP): the techniques

that answer multi-dimensional analytical (MDA) queries efficiently

Jian Pei: Big Data Analytics -- Multidimensional Analysis 6

OLAP Operations

•  Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction –  (Day, Store, Product type, SUM(sales) !

(Month, City, *, SUM(sales)) •  Drill down (roll down): reverse of roll-up,

from higher level summary to lower level summary or detailed data, or introducing new dimensions

2

Other Operations

•  Dice: pick specific values or ranges on some dimensions

•  Pivot: “rotate” a cube – changing the order of dimensions in visual analysis

Jian Pei: Big Data Analytics -- Multidimensional Analysis 7

http://en.wikipedia.org/wiki/File:OLAP_pivoting.png

Jian Pei: Big Data Analytics -- Multidimensional Analysis 8

Relational Representation

•  If there are n dimensions, there are 2n possible aggregation columns

Roll up by model by year by color in a table

Jian Pei: Big Data Analytics -- Multidimensional Analysis 9

Difficulties

•  Many group bys are needed – 6 dimensions ! 26=64 group bys

•  In most SQL systems, the resulting query needs 64 scans of the data, 64 sorts or hashes, and a long wait!

Jian Pei: Big Data Analytics -- Multidimensional Analysis 10

Dummy Value �ALL�

Jian Pei: Big Data Analytics -- Multidimensional Analysis 11

CUBE

SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39

DATA CUBE Model Year Color Sales

CUBE

Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941

SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color);

Jian Pei: Big Data Analytics -- Multidimensional Analysis 12

Semantics of ALL

•  ALL is a set – Model.ALL = ALL(Model) = {Chevy, Ford } – Year.ALL = ALL(Year) = {1990,1991,1992} – Color.ALL = ALL(Color) = {red,white,blue}

3

Jian Pei: Big Data Analytics -- Multidimensional Analysis 13

OLTP Versus OLAP OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support DB design application-oriented subject-oriented

data current, up-to-date, detailed, flat relational Isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write, index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed

tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Jian Pei: Big Data Analytics -- Multidimensional Analysis 14

What Is a Data Warehouse?

•  �A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management�s decision-making process.�

– W. H. Inmon •  Data warehousing: the process of

constructing and using data warehouses

Jian Pei: Big Data Analytics -- Multidimensional Analysis 15

Subject-Oriented

•  Organized around major subjects, such as customer, product, sales

•  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

•  Providing a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

Jian Pei: Big Data Analytics -- Multidimensional Analysis 16

Integrated

•  Integrating multiple, heterogeneous data sources –  Relational databases, flat files, on-line transaction

records •  Data cleaning and data integration

–  Ensuring consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

•  E.g., Hotel price: currency, tax, breakfast covered, etc.

–  When data is moved to the warehouse, it is converted

Jian Pei: Big Data Analytics -- Multidimensional Analysis 17

Time Variant

•  The time horizon for the data warehouse is significantly longer than that of operational systems –  Operational databases: current value data –  Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years) •  Every key structure in the data warehouse contains

an element of time, explicitly or implicitly –  But the key of operational data may or may not contain �time element�

Jian Pei: Big Data Analytics -- Multidimensional Analysis 18

Nonvolatile

•  A physically separate store of data transformed from the operational environment

•  Operational updates of data do not occur in the data warehouse environment – Do not require transaction processing, recovery,

and concurrency control mechanisms – Require only two operations in data accessing

•  Initial loading of data •  Access of data

4

Jian Pei: Big Data Analytics -- Multidimensional Analysis 19

Why Separate Data Warehouse?

•  High performance for both – Operational DBMS: tuned for OLTP – Warehouse: tuned for OLAP

•  Different functions and different data – Historical data: data analysis often uses

historical data that operational databases do not typically maintain

– Data consolidation: data analysis requires consolidation (aggregation, summarization) of data from heterogeneous sources

Jian Pei: Big Data Analytics -- Multidimensional Analysis 20

Star Schema

time_key day day_of_the_week month quarter year

time

location_key street city state_or_province country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key item_name brand type supplier_type

item

branch_key branch_name branch_type

branch

Jian Pei: Big Data Analytics -- Multidimensional Analysis 21

Snowflake Schema

time_key day day_of_the_week month quarter year

time

location_key street city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key item_name brand type supplier_key

item

branch_key branch_name branch_type

branch

supplier_key supplier_type

supplier

city_key city state_or_province country

city

Fact Constellation

time_key day day_of_the_week month quarter year

time

location_key street city province_or_state country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales Measures

item_key item_name brand type supplier_type

item

branch_key branch_name branch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key shipper_name location_key shipper_type

shipper

Jian Pei: Big Data Analytics -- Multidimensional Analysis 22

Jian Pei: Big Data Analytics -- Multidimensional Analysis 23

(Good) Aggregate Functions

•  Distributive: there is a function G() such that F({Xi,j}) = G({F({Xi,j |i=1,...,I}) | j=1,...J}) –  Examples: COUNT(), MIN(), MAX(), SUM() –  G=SUM() for COUNT()

•  Algebraic: there is an M-tuple valued function G() and a function H() such that F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., J }) –  Examples: AVG(), standard deviation, MaxN(), MinN() –  For AVG(), G() records sum and count, H() adds these

two components and divides to produce the global average

Jian Pei: Big Data Analytics -- Multidimensional Analysis 24

Holistic Aggregate Functions

•  There is no constant bound on the size of the storage needed to describe a sub-aggregate. – There is no constant M, such that an M-tuple

characterizes the computation F({Xi,j |i=1,...,I}).

•  Examples: Median(), MostFrequent() (also called the Mode()), and Rank()

5

Jian Pei: Big Data Analytics -- Multidimensional Analysis 25

Index Requirements in OLAP

•  Data is read only –  (Almost) no insertion or deletion

•  Query types – Point query: looking up one specific tuple (rare) – Range query: returning the aggregate of a

(large) set of tuples, with group by – Complex queries: need specific algorithms and

index structures, will be discussed later

Jian Pei: Big Data Analytics -- Multidimensional Analysis 26

OLAP Query Example

•  In table (cust, gender, …), find the total number of male customers

•  Method 1: scan the table once •  Method 2: build a B+ tree index on attribute

gender, still need to access all tuples of male customers

•  Can we get the count without scanning many tuples, even not all tuples of male customers?

Jian Pei: Big Data Analytics -- Multidimensional Analysis 27

Bitmap Index

•  For n tuples, a bitmap index has n bits and can be packed into !n /8" bytes and !n /32" words

•  From a bit to the row-id: the j-th bit of the p-th byte ! row-id = p*8 +j cust gender …

Jack M … Cathy F … … … …

Nancy F …

1 0 … 0

Jian Pei: Big Data Analytics -- Multidimensional Analysis 28

Using Bitmap to Count

•  Shcount[] contains the number of bits in the entry subscript – shcount[01100101]=4 count = 0; for (i = 0; i < SHNUM; i++) count += shcount[B[i]];

Jian Pei: Big Data Analytics -- Multidimensional Analysis 29

Advantages of Bitmap Index

•  Efficient in space •  Ready for logic composition

– C = C1 AND C2 – Bitmap operations can be used

•  Bitmap index only works for categorical data with low cardinality – Naively, we need 50 bits per entry to represent

the state of a customer in US – How to represent a sale in dollars?

Jian Pei: Big Data Analytics -- Multidimensional Analysis 30

Bit-Sliced Index

•  A sale amount can be written as an integer number of pennies, and then represented as a binary number of N bits – 24 bits is good for up to $167,772.15,

appropriate for many stores •  A bit-sliced index is N bitmaps

– Tuple j sets in bitmap k if the k-th bit in its binary representation is on

– The space costs of bit-sliced index is the same as storing the data directly

6

Jian Pei: Big Data Analytics -- Multidimensional Analysis 31

Using Indexes

SELECT SUM(sales) FROM Sales WHERE C; – Tuples satisfying C is identified by a bitmap B

•  Direct access to rows to calculate SUM: scan the whole table once

•  B+ tree: find the tuples from the tree •  Projection index: only scan attribute sales •  Bit-sliced index: get the sum from ∑(B AND

Bk)*2k

Jian Pei: Big Data Analytics -- Multidimensional Analysis 32

Cost Comparison

•  Traditional value-list index (B+ tree) is costly in both I/O and CPU time – Not good for OLAP

•  Bit-sliced index is efficient in I/O •  Other case studies in [O�Neil and Quass,

SIGMOD�97]

Jian Pei: Big Data Analytics -- Multidimensional Analysis 33

Horizontal or Vertical Storage

•  A fact table for data warehousing is often fat –  Tens of even hundreds of dimensions/attributes

•  A query is often about only a few attributes •  Horizontal storage: tuples are stored one by one •  Vertical storage: tuples are stored by attributes

A1 A2 … A100

x1 x2 … x100

… … … … z1 z2 … z100

A1 A2 … A100

x1 x2 … x100

… … … … z1 z2 … z100

Jian Pei: Big Data Analytics -- Multidimensional Analysis 34

Horizontal Versus Vertical •  Find the information of tuple t

–  Typical in OLTP –  Horizontal storage: get the whole tuple in one search –  Vertical storage: search 100 lists

•  Find SUM(a100) GROUP BY {a22, a83} –  Typical in OLAP –  Horizontal storage (no index): search all tuples O(100n),

where n is the number of tuples –  Vertical storage: search 3 lists O(3n), 3% of the

horizontal storage method •  Projection index: vertical storage

Jian Pei: Big Data Analytics -- Multidimensional Analysis 35

Rolling-up/Drilling-down Analysis Roll up by model by year by color Not a table, many NULL values, no key

Pivot

Jian Pei: Big Data Analytics -- Multidimensional Analysis 36

Extending GROUP BY SELECT Manufacturer, Year , Month, Day, Color, Model, SUM(price) AS Revenue FROM Sales GROUP BY Manufacturer, ROLLUP Year(Time) AS Year, Month(Time) AS Month, Day(Time) AS Day, CUBE Color, Model;

Manufacturer Year, Mo, Day

Mod

el x

Col

orcu

bes

7

Jian Pei: Big Data Analytics -- Multidimensional Analysis 37

CUBE

SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39

DATA CUBE Model Year Color Sales

CUBE

Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941

SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color);

Jian Pei: Big Data Analytics -- Multidimensional Analysis 38

MOLAP

Date

Produ

ct

Cou

ntry

sum

sum TV

VCR PC

1Qtr 2Qtr 3Qtr 4Qtr U.S.A

Canada

Mexico

sum

Jian Pei: Big Data Analytics -- Multidimensional Analysis 39

Pros and Cons

•  Easy to implement •  Fast retrieval •  Many entries may be empty if data is sparse •  Costly in space

Jian Pei: Big Data Analytics -- Multidimensional Analysis 40

ROLAP – Data Cube in Table

•  A multi-dimensional database Base table

Dimensions Measure Store Product Season AVG(Sales)

S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9 S1 * Spring 9 … … … … * * * 9

Dimensions Measure Store Product Season Sales

S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9

Cubing

Jian Pei: Big Data Analytics -- Multidimensional Analysis 41

Observations

•  Once a base table (A, B, C) is sorted by A-B-C, aggregates (*,*,*), (A,*,*), (A,B,*) and (A,B,C) can be computed with one scan and 4 counters

•  To compute other aggregates, we can sort the base table in some other orders

Jian Pei: Big Data Analytics -- Multidimensional Analysis 42

How to Sort the Base Table?

•  General sorting in main memory O(nlogn) •  Counting in main memory O(n), linear to the

number of tuples in the base table – How to sort 1 million integers in range 0 to 100? – Set up 100 counters, initiate them to 0�s – Scan the integers once, count the occurrences

of each value in 1 to 100 – Scan the integers again, put the integers to the

right places

8

Jian Pei: Big Data Analytics -- Multidimensional Analysis 43

Iceberg Cube

•  In a data cube, many aggregate cells are trivial – Having an aggregate too small

•  Iceberg query

Jian Pei: Big Data Analytics -- Multidimensional Analysis 44

Monotonic Iceberg Condition

•  If COUNT(a, b, *)<100, then COUNT(a, b, c)<100 for any c

•  For cells c1 and c2, c1 is called an ancestor of c2 if in all dimensions that c1 takes a non-* value, c2 agrees with c1 –  (a,b,*) is an ancestor of (a,b,c)

•  An iceberg condition P is monotonic if for any aggregate cell c failing P, any descendants of c cannot honor P

Jian Pei: Big Data Analytics -- Multidimensional Analysis 45

Pushing Monotonic Conditions

•  BUC searches the aggregates bottom-up in depth-first manner

•  Only when a monotonic condition holds, the descendants of the current node should be expanded

Jian Pei: Big Data Analytics -- Multidimensional Analysis 46

How to Push Non-Monotonic Ones?

•  Condition P(c)=AVG(price)>=800 AND COUNT(*)>=50 is not monotonic

•  BUC cannot push such a constraint

Jian Pei: Big Data Analytics -- Multidimensional Analysis 47

Ideas

•  Let AVGk(price) be the average of the top-k tuples

•  AVGk(price)>=800 is a monotonic condition –  If the top-10 average of (Vancouver, *, *) is less

than 800, the top-10 average of (Vancouver, laptop, *) cannot be 800 or more

•  AVGk(price)>=800 can be a filter for AVG(price)>=800 –  If AVGk(price)<800, AVG(price)<800 – Generally, AVG()<=AVGk()

Jian Pei: Big Data Analytics -- Multidimensional Analysis 48

Minimal Cubing

•  Computing only a �shell� of a data cube – Only compute and materialize low dimensional

cuboids, dimensionality < k (k << n) – Save space and cubing time

•  Indexing the shell cells as well as their cover – the tuples contributing to the shell cells

•  Query answering – Using the shell cells and their intersection to

compute the non-materialized cells

9

Jian Pei: Big Data Analytics -- Multidimensional Analysis 49

A Data Cube Is Often Huge

•  10 dimensions, cardinality 20 for each dimension ! 2110=16,679,880,978,201 possible tuples in the cube

•  Even 1/1,000 of possible tuples are not empty, still more than 16 billion tuples

Jian Pei: Big Data Analytics -- Multidimensional Analysis 50

Compression of Data Cubes

•  Traditional compression methods, e.g., zip – High compression ratio – The compression cannot be queried directly

•  Requirements for data cube compression – The compression can be queried efficiently – High compression ratio

•  Lossless compression and lossy compression

Jian Pei: Big Data Analytics -- Multidimensional Analysis 51

Redundancy in Data Cube

•  A base table with only one tuple (a1, …, a100, 1000) and aggregate function SUM() – The data cube contains 2100 tuples! – Every query about SUM() returns 1000

•  A data cube or a sub-cube may be populated by a single tuple – base single tuple

•  We do not need to pre-compute and store all aggregates

Jian Pei: Big Data Analytics -- Multidimensional Analysis 52

A Little More General Case

•  A base table with two tuples, t1 = (a1, a2, b3, b4, 100) and t2 = (a1, a2, c3, c4, 1000), aggregate function SUM()

•  (a1, a2, *, *), (a1, *, *, *), (*, a2, *, *) and (*, *, *, *) all have sum 1100, since they are populated by the group of tuples {t1, t2} – base group tuples

Jian Pei: Big Data Analytics -- Multidimensional Analysis 53

Semantic Compression •  Can we summarize a data cube so that the

summarization can be browsed and understood effectively? –  The summarization itself is a compression –  The compression preserves the roll-up/drill-down

relation –  Directly query-able and browse-able for OLAP

•  Syntactic compression –  Not preserving the roll-up/drill-down semantics –  Directly query-able for some queries, but may not be

directly browse-able for OLAP

Jian Pei: Big Data Analytics -- Multidimensional Analysis 54

Cube Cell Lattice •  Observation: many cells may have same

aggregate values •  Can we summarize the semantics of the cube by

grouping cells by aggregate values? (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9 (*,P1,f):9

(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9

10

Jian Pei: Big Data Analytics -- Multidimensional Analysis 55

A Naïve Attempt

•  Put all cells of same agg values into a class •  The result is not a lattice anymore!

–  Anomaly: the rollup/drilldown semantics is lost

C1 C2 C3

C4

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9 (*,P1,f):9

(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9

Jian Pei: Big Data Analytics -- Multidimensional Analysis 56

A Better Partitioning

•  Quotient cube: partitioning preserving the rollup/drilldown semantics

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9

(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9

C1 C3

C5

C4

C2

Jian Pei: Big Data Analytics -- Multidimensional Analysis 57

Why Semantic Compression Useful?

C1 C2

C5

C4

C3

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9

(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9

Jian Pei: Big Data Analytics -- Multidimensional Analysis 58

Why Semantic Compression Useful?

C1 C2

C5

C4

(S2,P1,f):9

(S2,*,f):9 (S2,P1,*) (*,P1,f):9

(*,*,f):9 (S2,*,*):9

•  OLAP browsing

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9

(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9

Jian Pei: Big Data Analytics -- Multidimensional Analysis 59

Goals

•  Given a cube, characterize a good way (the quotient cube way) of partitioning its cells into classes such that – The partition generates a reduced lattice

preserving the roll-up/drill-down semantics – The partition is optimal: the number of classes

as small as possible •  Compute, index and store quotient cubes

efficiently to answer OLAP queries

Jian Pei: Big Data Analytics -- Multidimensional Analysis 60

Why Equivalent Aggregate Values?

•  Two cells have equivalent aggregate values if they cover the same set of tuples in the base table

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9

(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9

Tuples in base table

11

Jian Pei: Big Data Analytics -- Multidimensional Analysis 61

Cover Partition

•  For a cell c, a tuple t in base table is in c�s cover if t can be rolled up to c – E.g., Cov(S1,*,spring)={(S1,P1,spring),

(S1,P2,spring)}

Dimensions Measure Store Product Season Sales

S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9

Jian Pei: Big Data Analytics -- Multidimensional Analysis 62

Cover Partitions & Aggregates

•  All cells in a cover partition carry the same aggregate value with respect to any aggregate function – But cells in a class of MIN() may have different

covers •  For COUNT() and SUM() (positive), cover

equivalence coincides with aggregate equivalence

Jian Pei: Big Data Analytics -- Multidimensional Analysis 63

Quotient Cube

•  A quotient cube is a quotient lattice of the cube lattice such that – Each class is convex and connected – All cells in a class carry the identical aggregate

value w.r.t. a given aggregate function •  Quotient cube preserves the roll-up / drill-

down semantics

Jian Pei: Big Data Analytics -- Multidimensional Analysis 64

•  “?�7*���%��– ���·���� – Two dimensions: ?�and 7*�– Preferences: “?�'(1��7* '(1�” – Multidimensional decision problems have a long

history – more than 2300 years •  Multidimensional decision problems are

often challenging – >2�$�– ����&� – ��<�– �·-+��63=�

Multi-Criteria Decision Problems

Jian Pei: Big Data Analytics -- Multidimensional Analysis 65

Skyline – Best Tradeoffs

•  Two dimensions: distance to water and height •  Skyline: the buildings that are not dominated by

any other buildings in both dimensions

SFU Harbor Center

Jian Pei: Big Data Analytics -- Multidimensional Analysis 66

Skyline: Formal Definition

•  A set of objects S in an n-dimensional space D=(D1, …, Dn) – Numeric dimensions for illustration in this talk

•  For u, v ∈ S, u dominates v if – u is better than v in one dimension, and – u is not worse than v in any other dimensions – For illustration in this talk, the smaller the better

•  u ∈ S is a skyline object if u is not dominated by any other objects in S

12

Example

Jian Pei: Big Data Analytics -- Multidimensional Analysis 67

skyline points

Price

travel time

uv

Jian Pei: Big Data Analytics -- Multidimensional Analysis 68

Skyline Computation

•  First investigated as the maximum vector problem in [Kung et al. JACM 1975] – An O(n logd-2n) time algorithm for d ≥ 4 and an

O(n log n) time algorithm for d = 2 and 3 – Divide-and-conquer-based methods: DD&C,

LD&C, FLET •  Skyline computation in database context

– Data cannot be held into main memory – External algorithms

Jian Pei: Big Data Analytics -- Multidimensional Analysis 69

Skyline Computation on Large DB

•  A rule of thumb in database research – scalability on large databases

•  Index-based methods –  Using bitmaps and the relationships between the skyline

and the minimum coordinates of individual points, by Tan et al.

–  Using nearest-neighbor search by Kossmann et al. –  The progressive branch-and-bound method by

Papadias et al. •  Index-free methods

–  Divide-and-conquer and block nested loops by Borzsonyi et al.

–  Sort-first-skyline (SFS) by Chomicki et al.

Jian Pei: Big Data Analytics -- Multidimensional Analysis 70

Full Space Skyline Is Not Enough!

•  Skylines in subspaces – Skyline in space (# stops, price, travel-time) –  If one does not care about # stops, how can we

derive the superior trade-offs between price and travel-time from the full space skyline?

•  Sky cube – computing skylines in all non-empty subspaces (Yuan et al., VLDB�05) – A database/data warehousing approach – Any subspace skyline queries can be answered

(efficiently)

Jian Pei: Big Data Analytics -- Multidimensional Analysis 71

Sky Cube

Jian Pei: Big Data Analytics -- Multidimensional Analysis 72

Understanding Skylines

•  9,54 •  Both Wilt Chamberlain and Michael Jordan

are in the full space skyline of the Great NBA Players

•  Data mining/exploration-driven questions – Which merits, respectively, really make them

outstanding? – How are they different?

13

Jian Pei: Big Data Analytics -- Multidimensional Analysis 73

Redundancy in Sky Cube

Does it just happen that skylines in multiple subspaces are identical?

Jian Pei: Big Data Analytics -- Multidimensional Analysis 74

Mining Decisive Subspaces

•  Decisive subspaces – the minimal combinations of factors that determine the (subspace) skyline membership of an object

•  Examples – Total rebounds for Chamberlain – For Jordan, (total points, total rebounds, total

assists) and (games played, total points, total assists)

•  Details in [Pei et al., VLDB 2005]

Jian Pei: Big Data Analytics -- Multidimensional Analysis 75

Database & Data Mining Can Meet

•  ��� – �·������ •  Conceptually, computing skylines in all subspaces •  Only computing skyline groups and their decisive

subspaces –  Concise representation, leading to fast algorithms –  [Pei et al., ACM TODS 2006]

•  Improvement: borrowing frequent itemset mining techniques to speed up computation in high dimensional spaces [Pei et al., ICDE 2007]

Jian Pei: Big Data Analytics -- Multidimensional Analysis 76

DB Extensions and Applications

•  Improving database query answering – Efficient skyline query answering in subspaces

[Tao et al., ICDE 2006] – Effective summary of skyline: distance-based

representative skyline [Tao et al., ICDE 2009] •  Extensions in data types

– Probabilistic skylines on uncertain data [Pei et al., VLDB 2007]

–  Interval skyline queries on time series [Jiang and Pei, ICDE 2009]

Jian Pei: Big Data Analytics -- Multidimensional Analysis 77

Dynamic User Preferences

Different customers may have different preferences

•  ��<:�– �·0;��"8#�

Jian Pei: Big Data Analytics -- Multidimensional Analysis 78

Personalized Recommendations

•  )�(��– � �·!./� �

14

Jian Pei: Big Data Analytics -- Multidimensional Analysis 79

Favorable Facet Mining

•  ���� – �·������� •  A set of points in a multidimensional space

– Fully ordered attributes: the preference orders are fixed, e.g., price, star-level, and quality

–  (Categorical) Partially ordered attributes: the preference orders are not fully determined, e.g., airlines, hotel groups, and property types

•  Some templates may apply, e.g., single houses > semi-detached houses

•  Favorable facts of a point p: the partial orders that make p in the skyline

Jian Pei: Big Data Analytics -- Multidimensional Analysis 80

Monotonicity of Partial Orders If p is not in the skyline with respect to partial R, p is not in the skyline with any partial order stronger than R

Jian Pei: Big Data Analytics -- Multidimensional Analysis 81

Minimal Disqualifying Conditions

•  ���� •  For a point p, a most general partial order that

disqualifies p in the skyline is a minimal disqualifying condition (MDC)

•  Any partial orders stronger than an MDC cannot make p in the skyline

•  How to compute MDC�s efficiently? –  MDC-O: computing MDC�s on the fly –  MDC-M: materializing MDC�s –  Details in [Wong et al., KDD 2007]

Jian Pei: Big Data Analytics -- Multidimensional Analysis 82

Skyline Warehouse on Preferences

•  Materializing all MCDs and precompute skylines –  Using an Implicit Preference Order tree (IPO-tree) index

•  Can online answer skyline queries with respect to any user preferences

•  Details in [Wong et al., VLDB 2008]

Jian Pei: Big Data Analytics -- Multidimensional Analysis 83

Learning User Preferences •  Realtors selling realties – a typical multi-criteria

decision problem –  User preferences on multiple dimensions: location, size,

price, style, age, developer, … –  Thousands of realties

•  How can a realtor learn a user�s preferences on dimensions? –  ��������� – ���·�� –  Give a user a short list of realties and ask the user to

pick the ones (s)he is/is not interested in –  An interesting realty – a skyline point in the short list –  An uninteresting realty – a non-skyline in the short list

Jian Pei: Big Data Analytics -- Multidimensional Analysis 84

Mining Preferences from Examples

•  ���� •  Given a set of example points labeled skyline or

non-skyline in a multidimensional space, can we learn the preferences on attributes? –  Favorable facets are for one superior example only

•  Mining the minimal satisfying preference sets (SPS) –  The simplest hypotheses that fit the superior and inferior

examples

15

Jian Pei: Big Data Analytics -- Multidimensional Analysis 85

Learning Methods

•  Complexity – The SPS existence problem is NP-hard – The minimal SPS problem is NP-hard

•  A greedy approach – The term-based greedy algorithm – The condition-based greedy algorithm – Details in [Jiang et al., KDD�08]

Multidimensional Analysis of Logs

•  Look-up: “What are the top-5 electronics that were most popularly searched by the users in the US in December, 2009?”

•  Reverse look-up: “What are the group-bys in time and region where Apple iPad was popularly searched for?”

•  Different users/applications may bear different concept hierarchies in mind in their multidimensional analysis

Jian Pei: Big Data Analytics -- Multidimensional Analysis 86

A Topic-Concept Cube Approach

Jian Pei: Big Data Analytics -- Multidimensional Analysis 87

A Successful Case Study

Jian Pei: Big Data Analytics -- Multidimensional Analysis 88