Data cleaning and integration - ict.ac.cnnovel.ict.ac.cn/files/Day 2.pdf · J. Pei: Big Data Analytics -- Data Cleaning and Integration 23 ... – In TPC, 6 standard benchmarks have

2014-05-06

1

Data Cleaning and Integration

Data Quality

•  Accuracy •  Completeness •  Consistency •  Timeliness •  Believability •  Interpretability

J. Pei: Big Data Analytics -- Data Cleaning and Integration 2

Data Preprocessing

•  Processing data before an analytic task –  Improve data quality – Transform data to facilitate the target task

•  Major tasks – Data cleaning – Data integration – Data reduction – Data transformation


Data Cleaning

•  The process of detecting and correcting corrupt or inaccurate records from data

•  Handling missing values •  Smoothing data


Handling Missing Values

•  Ignore records with missing values •  Fill in missing values

– Manually – Using a global constant – Using a measure of central tendency for the

attribute, such as mean, median, or mode – Using the central tendency of the class – Using the most probable value


•  Disguised missing data is the missing data entries that are not explicitly represented as such, but instead appear as potentially valid data values –  Information about "State" is missing –  "Alabama" is used as disguise

Disguised Missing Data?

Online forms


2014-05-06

2

Disguised Missing Data Is Misleading


•  Wrong conclusion •  Unreasonable results

Types of Disguised Missing Data

•  Randomly choose a valid value as disguise

•  A small number of values are chosen as disguise


Number of customers

0

500

1000

1500

2000

2500

3000

3500

Alabama Ohio Washington

Number of customers

0

500

1000

1500

2000

2500

3000

3500


Real values Disguised missing values

Problem Definition

•  Cleaning disguised missing data

•  Examples –  “Alabama” in “state” –  “0” in “blood pressure” –  “21” in “age”

Given a table T with attributes A, an integer k For each attribute Ai, output k candidates of frequently used disguise values


Ideas

•  Observation 1: Frequently used disguises – A small number of values are frequently used as

the disguises •  Observation 2: Missing at random

– Missing data are often distributed randomly

Number of customers

0

500

1000

1500

2000

2500

3000

3500


A random subset of the whole database


General Framework

•  For each attribute A – For each frequent value v

in A •  Compute the maximal

embedded unbiased sample contained in Tv

– Return the k values with the best (in both quality and size) embedded unbiased sample


Id State Age Gender

1 Alabama 30 M

2 Alabama 30 M

3 Alabama 30 F

4 Alabama 20 F

5 Ohio 20 F

6 Ohio 20 F

Smoothing Noisy Data

•  Noise: a random error or variance in a measured variable

•  Smoothing noise – removing noise


2014-05-06

3

Binning


Sorted data for price (in dollars) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equal-frequency) bins :

Bin1 : 4, 8, 15Bin2 : 21, 21, 24Bin3 : 25, 28, 34

Smoothing by bin means :

Bin1 : 9, 9, 9Bin2 : 22, 22, 22Bin3 : 29, 29, 29

Smoothing by bin boundaries :

Bin1 : 4, 4, 15Bin2 : 21, 21, 24Bin3 : 25, 25, 34

Regression


Outlier Analysis


Data Cleaning as a Process •  Data discrepancy detection

–  Use metadata (e.g., domain, range, dependency, distribution) –  Check field overloading –  Check uniqueness rule, consecutive rule and null rule –  Use commercial tools

•  Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections

•  Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)

•  Data migration and integration –  Data migration tools: allow transformations to be specified –  ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface •  Integration of the two processes

–  Iterative and interactive (e.g., Potter�s Wheels)


Data Integration

•  Combining data from multiple (autonomous and heterogeneous) sources

•  Providing a unified view •  Why is data integration hard?

– Systems challenges – Data logical organization challenges – Social and administrative challenges


Data Integration System Architecture


http://en.wikipedia.org/wiki/File:Dataintegration.png

2014-05-06

4

Wrappers

•  Computer programs that extract content from a particular data source and transform into a target form, such as a relational table

•  Example: CMS (content management system) wrapper


<html> <head> <title> %page_title%</title> </head> <body> %page_content% <P> %page_powered_by% </body> </html>

How to Build Wrappers?

•  Manual construction •  Machine learning based methods: learning

schemas from training data – Supervised learning approaches – Unsupervised learning approaches


Schema Matching and Mapping •  Schema matching: finding the semantic

correspondences between attributes in data sources and those in the mediated schema –  Example: “attribute name in source S1 corresponds to

attributes firstname and surname in the mediated schema

–  Name based matching –  Instance based matching

•  Schema mapping: transforming attribute values from sources to mediated schema –  Example: a query or a program extracting name values

from source S1, and forming firstname and surname values for the mediated schema


Entity Detection and Recognition

•  Entity detection: identify atomic elements in text or other data into predefined categories such as person names, locations, organizations, etc.

•  Entity disambiguation: identify entities carrying the same name


Example


Data Provenance

•  The data about how a data entry came to be – Also known as data lineage/predigree

•  The annotation approach: a series of annotations describing how each data item was produced

•  The graph of data relationships approach: connecting sources and deriving new data items via mapping


2014-05-06

5

Deep / Hidden Web •  Sites that are difficult for a crawler to find

–  Probably over 100 times larger than the traditionally indexed web •  Three major categories of sites in deep web

–  Private sites intentionally private – no incoming links or may require login

–  Form results – only accessible by entering data into a form, e.g., airline ticket queries

•  Hard to detect changes behind a form –  Scripted pages – using JavaScript, Flash, or another client-side

language in the web page •  A crawler needs to execute the script – can slow down crawling

significantly •  Deep web is different from dynamic pages

–  Wikis dynamically generates web pages but are easy to crawl –  Private sites are static but cannot be crawled


1

Multidimensional Analysis

Jian Pei: Big Data Analytics -- Multidimensional Analysis 2

Outline

•  Why multidimensional analysis? •  Multidimensional analysis principle •  OLAP •  OLAP indexes

Dimensions •  “An aspect or feature of a situation, problem, or

thing, a measurable extent of some kind” – Dictionary

•  Dimensions/attributes are used to model complex objects in a divide-and-conquer manner – Objects are compared in selected dimensions/

attributes •  More often than not, objects have too many

dimensions/attributes than one is interested in and can handle


Multi-dimensional Analysis

•  Find interesting patterns in multi-dimensional subspaces –  “Michael Jordan is outstanding in subspaces (total

points, total rebounds, total assists) and (number of games played, total points, total assists)”

•  Different patterns may be manifested in different subspaces – Feature selection (machine learning and statistics):

select a subset of relevant features for use in model construction – a set of features for all objects

– Different subspaces may manifest different patterns



OLAP •  Conceptually, we may explore all possible subspaces for

interesting patterns –  What patterns are interesting? –  How can we explore all possible subspaces systematically and

efficiently? –  Fundamental problems in analytics and data mining

•  Aggregates and group-bys are frequently used in data analysis and summarization SELECT time, altitude, AVG(temp) FROM weather GOUP BY time, altitude; –  In TPC, 6 standard benchmarks have 83 queries, aggregates are

used 59 times, group-bys are used 20 times •  Online analytical processing (OLAP): the techniques

that answer multi-dimensional analytical (MDA) queries efficiently


OLAP Operations

•  Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction –  (Day, Store, Product type, SUM(sales) !

(Month, City, *, SUM(sales)) •  Drill down (roll down): reverse of roll-up,

from higher level summary to lower level summary or detailed data, or introducing new dimensions

2

Other Operations

•  Dice: pick specific values or ranges on some dimensions

•  Pivot: “rotate” a cube – changing the order of dimensions in visual analysis


http://en.wikipedia.org/wiki/File:OLAP_pivoting.png


Relational Representation

•  If there are n dimensions, there are 2n possible aggregation columns

Roll up by model by year by color in a table


Difficulties

•  Many group bys are needed – 6 dimensions ! 26=64 group bys

•  In most SQL systems, the resulting query needs 64 scans of the data, 64 sorts or hashes, and a long wait!


Dummy Value �ALL�


CUBE

SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39

DATA CUBE Model Year Color Sales

CUBE

Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941

SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color);


Semantics of ALL

•  ALL is a set – Model.ALL = ALL(Model) = {Chevy, Ford } – Year.ALL = ALL(Year) = {1990,1991,1992} – Color.ALL = ALL(Color) = {red,white,blue}

3


OLTP Versus OLAP OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support DB design application-oriented subject-oriented

data current, up-to-date, detailed, flat relational Isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write, index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed

tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response


What Is a Data Warehouse?

•  �A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management�s decision-making process.�

– W. H. Inmon •  Data warehousing: the process of

constructing and using data warehouses


Subject-Oriented

•  Organized around major subjects, such as customer, product, sales

•  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

•  Providing a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process


Integrated

•  Integrating multiple, heterogeneous data sources –  Relational databases, flat files, on-line transaction

records •  Data cleaning and data integration

–  Ensuring consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

•  E.g., Hotel price: currency, tax, breakfast covered, etc.

–  When data is moved to the warehouse, it is converted


Time Variant

•  The time horizon for the data warehouse is significantly longer than that of operational systems –  Operational databases: current value data –  Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years) •  Every key structure in the data warehouse contains

an element of time, explicitly or implicitly –  But the key of operational data may or may not contain �time element�


Nonvolatile

•  A physically separate store of data transformed from the operational environment

•  Operational updates of data do not occur in the data warehouse environment – Do not require transaction processing, recovery,

and concurrency control mechanisms – Require only two operations in data accessing

•  Initial loading of data •  Access of data

4


Why Separate Data Warehouse?

•  High performance for both – Operational DBMS: tuned for OLTP – Warehouse: tuned for OLAP

•  Different functions and different data – Historical data: data analysis often uses

historical data that operational databases do not typically maintain

– Data consolidation: data analysis requires consolidation (aggregation, summarization) of data from heterogeneous sources


Star Schema

time_key day day_of_the_week month quarter year

time

location_key street city state_or_province country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key item_name brand type supplier_type

item

branch_key branch_name branch_type

branch


Snowflake Schema


time

location_key street city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key item_name brand type supplier_key

item


branch

supplier_key supplier_type

supplier

city_key city state_or_province country

city

Fact Constellation


time

location_key street city province_or_state country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales Measures

item_key item_name brand type supplier_type

item


branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key shipper_name location_key shipper_type

shipper



(Good) Aggregate Functions

•  Distributive: there is a function G() such that F({Xi,j}) = G({F({Xi,j |i=1,...,I}) | j=1,...J}) –  Examples: COUNT(), MIN(), MAX(), SUM() –  G=SUM() for COUNT()

•  Algebraic: there is an M-tuple valued function G() and a function H() such that F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., J }) –  Examples: AVG(), standard deviation, MaxN(), MinN() –  For AVG(), G() records sum and count, H() adds these

two components and divides to produce the global average


Holistic Aggregate Functions

•  There is no constant bound on the size of the storage needed to describe a sub-aggregate. – There is no constant M, such that an M-tuple

characterizes the computation F({Xi,j |i=1,...,I}).

•  Examples: Median(), MostFrequent() (also called the Mode()), and Rank()

5


Index Requirements in OLAP

•  Data is read only –  (Almost) no insertion or deletion

•  Query types – Point query: looking up one specific tuple (rare) – Range query: returning the aggregate of a

(large) set of tuples, with group by – Complex queries: need specific algorithms and

index structures, will be discussed later


OLAP Query Example

•  In table (cust, gender, …), find the total number of male customers

•  Method 1: scan the table once •  Method 2: build a B+ tree index on attribute

gender, still need to access all tuples of male customers

•  Can we get the count without scanning many tuples, even not all tuples of male customers?


Bitmap Index

•  For n tuples, a bitmap index has n bits and can be packed into !n /8" bytes and !n /32" words

•  From a bit to the row-id: the j-th bit of the p-th byte ! row-id = p*8 +j cust gender …

Jack M … Cathy F … … … …

Nancy F …

1 0 … 0


Using Bitmap to Count

•  Shcount[] contains the number of bits in the entry subscript – shcount[01100101]=4 count = 0; for (i = 0; i < SHNUM; i++) count += shcount[B[i]];


Advantages of Bitmap Index

•  Efficient in space •  Ready for logic composition

– C = C1 AND C2 – Bitmap operations can be used

•  Bitmap index only works for categorical data with low cardinality – Naively, we need 50 bits per entry to represent

the state of a customer in US – How to represent a sale in dollars?


Bit-Sliced Index

•  A sale amount can be written as an integer number of pennies, and then represented as a binary number of N bits – 24 bits is good for up to $167,772.15,

appropriate for many stores •  A bit-sliced index is N bitmaps

– Tuple j sets in bitmap k if the k-th bit in its binary representation is on

– The space costs of bit-sliced index is the same as storing the data directly

6


Using Indexes

SELECT SUM(sales) FROM Sales WHERE C; – Tuples satisfying C is identified by a bitmap B

•  Direct access to rows to calculate SUM: scan the whole table once

•  B+ tree: find the tuples from the tree •  Projection index: only scan attribute sales •  Bit-sliced index: get the sum from ∑(B AND

Bk)*2k


Cost Comparison

•  Traditional value-list index (B+ tree) is costly in both I/O and CPU time – Not good for OLAP

•  Bit-sliced index is efficient in I/O •  Other case studies in [O�Neil and Quass,

SIGMOD�97]


Horizontal or Vertical Storage

•  A fact table for data warehousing is often fat –  Tens of even hundreds of dimensions/attributes

•  A query is often about only a few attributes •  Horizontal storage: tuples are stored one by one •  Vertical storage: tuples are stored by attributes

A1 A2 … A100

x1 x2 … x100

… … … … z1 z2 … z100

A1 A2 … A100

x1 x2 … x100

… … … … z1 z2 … z100


Horizontal Versus Vertical •  Find the information of tuple t

–  Typical in OLTP –  Horizontal storage: get the whole tuple in one search –  Vertical storage: search 100 lists

•  Find SUM(a100) GROUP BY {a22, a83} –  Typical in OLAP –  Horizontal storage (no index): search all tuples O(100n),

where n is the number of tuples –  Vertical storage: search 3 lists O(3n), 3% of the

horizontal storage method •  Projection index: vertical storage


Rolling-up/Drilling-down Analysis Roll up by model by year by color Not a table, many NULL values, no key

Pivot


Extending GROUP BY SELECT Manufacturer, Year , Month, Day, Color, Model, SUM(price) AS Revenue FROM Sales GROUP BY Manufacturer, ROLLUP Year(Time) AS Year, Month(Time) AS Month, Day(Time) AS Day, CUBE Color, Model;

Manufacturer Year, Mo, Day

Mod

el x

Col

orcu

bes

7


CUBE

SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39

DATA CUBE Model Year Color Sales

CUBE

Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941

SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color);


MOLAP

Date

Produ

ct

Cou

ntry

sum

sum TV

VCR PC

1Qtr 2Qtr 3Qtr 4Qtr U.S.A

Canada

Mexico

sum


Pros and Cons

•  Easy to implement •  Fast retrieval •  Many entries may be empty if data is sparse •  Costly in space


ROLAP – Data Cube in Table

•  A multi-dimensional database Base table

Dimensions Measure Store Product Season AVG(Sales)

S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9 S1 * Spring 9 … … … … * * * 9

Dimensions Measure Store Product Season Sales

S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9

Cubing


Observations

•  Once a base table (A, B, C) is sorted by A-B-C, aggregates (*,*,*), (A,*,*), (A,B,*) and (A,B,C) can be computed with one scan and 4 counters

•  To compute other aggregates, we can sort the base table in some other orders


How to Sort the Base Table?

•  General sorting in main memory O(nlogn) •  Counting in main memory O(n), linear to the

number of tuples in the base table – How to sort 1 million integers in range 0 to 100? – Set up 100 counters, initiate them to 0�s – Scan the integers once, count the occurrences

of each value in 1 to 100 – Scan the integers again, put the integers to the

right places

8


Iceberg Cube

•  In a data cube, many aggregate cells are trivial – Having an aggregate too small

•  Iceberg query


Monotonic Iceberg Condition

•  If COUNT(a, b, *)<100, then COUNT(a, b, c)<100 for any c

•  For cells c1 and c2, c1 is called an ancestor of c2 if in all dimensions that c1 takes a non-* value, c2 agrees with c1 –  (a,b,*) is an ancestor of (a,b,c)

•  An iceberg condition P is monotonic if for any aggregate cell c failing P, any descendants of c cannot honor P


Pushing Monotonic Conditions

•  BUC searches the aggregates bottom-up in depth-first manner

•  Only when a monotonic condition holds, the descendants of the current node should be expanded


How to Push Non-Monotonic Ones?

•  Condition P(c)=AVG(price)>=800 AND COUNT(*)>=50 is not monotonic

•  BUC cannot push such a constraint


Ideas

•  Let AVGk(price) be the average of the top-k tuples

•  AVGk(price)>=800 is a monotonic condition –  If the top-10 average of (Vancouver, *, *) is less

than 800, the top-10 average of (Vancouver, laptop, *) cannot be 800 or more

•  AVGk(price)>=800 can be a filter for AVG(price)>=800 –  If AVGk(price)<800, AVG(price)<800 – Generally, AVG()<=AVGk()


Minimal Cubing

•  Computing only a �shell� of a data cube – Only compute and materialize low dimensional

cuboids, dimensionality < k (k << n) – Save space and cubing time

•  Indexing the shell cells as well as their cover – the tuples contributing to the shell cells

•  Query answering – Using the shell cells and their intersection to

compute the non-materialized cells

9


A Data Cube Is Often Huge

•  10 dimensions, cardinality 20 for each dimension ! 2110=16,679,880,978,201 possible tuples in the cube

•  Even 1/1,000 of possible tuples are not empty, still more than 16 billion tuples


Compression of Data Cubes

•  Traditional compression methods, e.g., zip – High compression ratio – The compression cannot be queried directly

•  Requirements for data cube compression – The compression can be queried efficiently – High compression ratio

•  Lossless compression and lossy compression


Redundancy in Data Cube

•  A base table with only one tuple (a1, …, a100, 1000) and aggregate function SUM() – The data cube contains 2100 tuples! – Every query about SUM() returns 1000

•  A data cube or a sub-cube may be populated by a single tuple – base single tuple

•  We do not need to pre-compute and store all aggregates


A Little More General Case

•  A base table with two tuples, t1 = (a1, a2, b3, b4, 100) and t2 = (a1, a2, c3, c4, 1000), aggregate function SUM()

•  (a1, a2, *, *), (a1, *, *, *), (*, a2, *, *) and (*, *, *, *) all have sum 1100, since they are populated by the group of tuples {t1, t2} – base group tuples


Semantic Compression •  Can we summarize a data cube so that the

summarization can be browsed and understood effectively? –  The summarization itself is a compression –  The compression preserves the roll-up/drill-down

relation –  Directly query-able and browse-able for OLAP

•  Syntactic compression –  Not preserving the roll-up/drill-down semantics –  Directly query-able for some queries, but may not be

directly browse-able for OLAP


Cube Cell Lattice •  Observation: many cells may have same

aggregate values •  Can we summarize the semantics of the cube by

grouping cells by aggregate values? (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9 (*,P1,f):9

(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9

10


A Naïve Attempt

•  Put all cells of same agg values into a class •  The result is not a lattice anymore!

–  Anomaly: the rollup/drilldown semantics is lost

C1 C2 C3

C4

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9 (*,P1,f):9

(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9


A Better Partitioning

•  Quotient cube: partitioning preserving the rollup/drilldown semantics

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9

(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9

C1 C3

C5

C4

C2


Why Semantic Compression Useful?

C1 C2

C5

C4

C3

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9


(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9


Why Semantic Compression Useful?

C1 C2

C5

C4

(S2,P1,f):9

(S2,*,f):9 (S2,P1,*) (*,P1,f):9

(*,*,f):9 (S2,*,*):9

•  OLAP browsing

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9


(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9


Goals

•  Given a cube, characterize a good way (the quotient cube way) of partitioning its cells into classes such that – The partition generates a reduced lattice

preserving the roll-up/drill-down semantics – The partition is optimal: the number of classes

as small as possible •  Compute, index and store quotient cubes

efficiently to answer OLAP queries


Why Equivalent Aggregate Values?

•  Two cells have equivalent aggregate values if they cover the same set of tuples in the base table

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9


(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9

Tuples in base table

11


Cover Partition

•  For a cell c, a tuple t in base table is in c�s cover if t can be rolled up to c – E.g., Cov(S1,*,spring)={(S1,P1,spring),

(S1,P2,spring)}

Dimensions Measure Store Product Season Sales

S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9


Cover Partitions & Aggregates

•  All cells in a cover partition carry the same aggregate value with respect to any aggregate function – But cells in a class of MIN() may have different

covers •  For COUNT() and SUM() (positive), cover

equivalence coincides with aggregate equivalence


Quotient Cube

•  A quotient cube is a quotient lattice of the cube lattice such that – Each class is convex and connected – All cells in a class carry the identical aggregate

value w.r.t. a given aggregate function •  Quotient cube preserves the roll-up / drill-

down semantics


•  “?�7*��%��– ��·�� – Two dimensions: ?�and 7*�– Preferences: “?�'(1��7* '(1�” – Multidimensional decision problems have a long

history – more than 2300 years •  Multidimensional decision problems are

often challenging – >2�$�– ��&� – ��<�– �·-+��63=�

Multi-Criteria Decision Problems


Skyline – Best Tradeoffs

•  Two dimensions: distance to water and height •  Skyline: the buildings that are not dominated by

any other buildings in both dimensions

SFU Harbor Center


Skyline: Formal Definition

•  A set of objects S in an n-dimensional space D=(D1, …, Dn) – Numeric dimensions for illustration in this talk

•  For u, v ∈ S, u dominates v if – u is better than v in one dimension, and – u is not worse than v in any other dimensions – For illustration in this talk, the smaller the better

•  u ∈ S is a skyline object if u is not dominated by any other objects in S

12

Example


skyline points

Price

travel time

uv


Skyline Computation

•  First investigated as the maximum vector problem in [Kung et al. JACM 1975] – An O(n logd-2n) time algorithm for d ≥ 4 and an

O(n log n) time algorithm for d = 2 and 3 – Divide-and-conquer-based methods: DD&C,

LD&C, FLET •  Skyline computation in database context

– Data cannot be held into main memory – External algorithms


Skyline Computation on Large DB

•  A rule of thumb in database research – scalability on large databases

•  Index-based methods –  Using bitmaps and the relationships between the skyline

and the minimum coordinates of individual points, by Tan et al.

–  Using nearest-neighbor search by Kossmann et al. –  The progressive branch-and-bound method by

Papadias et al. •  Index-free methods

–  Divide-and-conquer and block nested loops by Borzsonyi et al.

–  Sort-first-skyline (SFS) by Chomicki et al.


Full Space Skyline Is Not Enough!

•  Skylines in subspaces – Skyline in space (# stops, price, travel-time) –  If one does not care about # stops, how can we

derive the superior trade-offs between price and travel-time from the full space skyline?

•  Sky cube – computing skylines in all non-empty subspaces (Yuan et al., VLDB�05) – A database/data warehousing approach – Any subspace skyline queries can be answered

(efficiently)


Sky Cube


Understanding Skylines

•  9,54 •  Both Wilt Chamberlain and Michael Jordan

are in the full space skyline of the Great NBA Players

•  Data mining/exploration-driven questions – Which merits, respectively, really make them

outstanding? – How are they different?

13


Redundancy in Sky Cube

Does it just happen that skylines in multiple subspaces are identical?


Mining Decisive Subspaces

•  Decisive subspaces – the minimal combinations of factors that determine the (subspace) skyline membership of an object

•  Examples – Total rebounds for Chamberlain – For Jordan, (total points, total rebounds, total

assists) and (games played, total points, total assists)

•  Details in [Pei et al., VLDB 2005]


Database & Data Mining Can Meet

•  �� – �·�� •  Conceptually, computing skylines in all subspaces •  Only computing skyline groups and their decisive

subspaces –  Concise representation, leading to fast algorithms –  [Pei et al., ACM TODS 2006]

•  Improvement: borrowing frequent itemset mining techniques to speed up computation in high dimensional spaces [Pei et al., ICDE 2007]


DB Extensions and Applications

•  Improving database query answering – Efficient skyline query answering in subspaces

[Tao et al., ICDE 2006] – Effective summary of skyline: distance-based

representative skyline [Tao et al., ICDE 2009] •  Extensions in data types

– Probabilistic skylines on uncertain data [Pei et al., VLDB 2007]

–  Interval skyline queries on time series [Jiang and Pei, ICDE 2009]


Dynamic User Preferences

Different customers may have different preferences

•  ��<:�– �·0;��"8#�


Personalized Recommendations

•  )�(��– � �·!./� �

14


Favorable Facet Mining

•  �� – �·�� •  A set of points in a multidimensional space

– Fully ordered attributes: the preference orders are fixed, e.g., price, star-level, and quality

–  (Categorical) Partially ordered attributes: the preference orders are not fully determined, e.g., airlines, hotel groups, and property types

•  Some templates may apply, e.g., single houses > semi-detached houses

•  Favorable facts of a point p: the partial orders that make p in the skyline


Monotonicity of Partial Orders If p is not in the skyline with respect to partial R, p is not in the skyline with any partial order stronger than R


Minimal Disqualifying Conditions

•  �� •  For a point p, a most general partial order that

disqualifies p in the skyline is a minimal disqualifying condition (MDC)

•  Any partial orders stronger than an MDC cannot make p in the skyline

•  How to compute MDC�s efficiently? –  MDC-O: computing MDC�s on the fly –  MDC-M: materializing MDC�s –  Details in [Wong et al., KDD 2007]


Skyline Warehouse on Preferences

•  Materializing all MCDs and precompute skylines –  Using an Implicit Preference Order tree (IPO-tree) index

•  Can online answer skyline queries with respect to any user preferences

•  Details in [Wong et al., VLDB 2008]


Learning User Preferences •  Realtors selling realties – a typical multi-criteria

decision problem –  User preferences on multiple dimensions: location, size,

price, style, age, developer, … –  Thousands of realties

•  How can a realtor learn a user�s preferences on dimensions? –  �� – ��·�� –  Give a user a short list of realties and ask the user to

pick the ones (s)he is/is not interested in –  An interesting realty – a skyline point in the short list –  An uninteresting realty – a non-skyline in the short list


Mining Preferences from Examples

•  �� •  Given a set of example points labeled skyline or

non-skyline in a multidimensional space, can we learn the preferences on attributes? –  Favorable facets are for one superior example only

•  Mining the minimal satisfying preference sets (SPS) –  The simplest hypotheses that fit the superior and inferior

examples

15


Learning Methods

•  Complexity – The SPS existence problem is NP-hard – The minimal SPS problem is NP-hard

•  A greedy approach – The term-based greedy algorithm – The condition-based greedy algorithm – Details in [Jiang et al., KDD�08]

Multidimensional Analysis of Logs

•  Look-up: “What are the top-5 electronics that were most popularly searched by the users in the US in December, 2009?”

•  Reverse look-up: “What are the group-bys in time and region where Apple iPad was popularly searched for?”

•  Different users/applications may bear different concept hierarchies in mind in their multidimensional analysis


A Topic-Concept Cube Approach


A Successful Case Study


Documents

Data cleaning and integration - ict.ac.cnnovel.ict.ac.cn/files/Day 2.pdf · J. Pei: Big Data Analytics -- Data Cleaning and Integration 23 ... – In TPC, 6 standard benchmarks have