Upload
ngokhanh
View
216
Download
0
Embed Size (px)
Citation preview
2014-05-06
1
Data Cleaning and Integration
Data Quality
• Accuracy • Completeness • Consistency • Timeliness • Believability • Interpretability
J. Pei: Big Data Analytics -- Data Cleaning and Integration 2
Data Preprocessing
• Processing data before an analytic task – Improve data quality – Transform data to facilitate the target task
• Major tasks – Data cleaning – Data integration – Data reduction – Data transformation
J. Pei: Big Data Analytics -- Data Cleaning and Integration 3
Data Cleaning
• The process of detecting and correcting corrupt or inaccurate records from data
• Handling missing values • Smoothing data
J. Pei: Big Data Analytics -- Data Cleaning and Integration 4
Handling Missing Values
• Ignore records with missing values • Fill in missing values
– Manually – Using a global constant – Using a measure of central tendency for the
attribute, such as mean, median, or mode – Using the central tendency of the class – Using the most probable value
J. Pei: Big Data Analytics -- Data Cleaning and Integration 5
• Disguised missing data is the missing data entries that are not explicitly represented as such, but instead appear as potentially valid data values – Information about "State" is missing – "Alabama" is used as disguise
Disguised Missing Data?
Online forms
J. Pei: Big Data Analytics -- Data Cleaning and Integration 6
2014-05-06
2
Disguised Missing Data Is Misleading
J. Pei: Big Data Analytics -- Data Cleaning and Integration 7
• Wrong conclusion • Unreasonable results
Types of Disguised Missing Data
• Randomly choose a valid value as disguise
• A small number of values are chosen as disguise
J. Pei: Big Data Analytics -- Data Cleaning and Integration 8
Number of customers
0
500
1000
1500
2000
2500
3000
3500
Alabama Ohio Washington
Number of customers
0
500
1000
1500
2000
2500
3000
3500
Alabama Ohio Washington
Real values Disguised missing values
Problem Definition
• Cleaning disguised missing data
• Examples – “Alabama” in “state” – “0” in “blood pressure” – “21” in “age”
Given a table T with attributes A, an integer k For each attribute Ai, output k candidates of frequently used disguise values
J. Pei: Big Data Analytics -- Data Cleaning and Integration 9
Ideas
• Observation 1: Frequently used disguises – A small number of values are frequently used as
the disguises • Observation 2: Missing at random
– Missing data are often distributed randomly
Number of customers
0
500
1000
1500
2000
2500
3000
3500
Alabama Ohio Washington
A random subset of the whole database
J. Pei: Big Data Analytics -- Data Cleaning and Integration 10
General Framework
• For each attribute A – For each frequent value v
in A • Compute the maximal
embedded unbiased sample contained in Tv
– Return the k values with the best (in both quality and size) embedded unbiased sample
J. Pei: Big Data Analytics -- Data Cleaning and Integration 11
Id State Age Gender
1 Alabama 30 M
2 Alabama 30 M
3 Alabama 30 F
4 Alabama 20 F
5 Ohio 20 F
6 Ohio 20 F
Smoothing Noisy Data
• Noise: a random error or variance in a measured variable
• Smoothing noise – removing noise
J. Pei: Big Data Analytics -- Data Cleaning and Integration 12
2014-05-06
3
Binning
J. Pei: Big Data Analytics -- Data Cleaning and Integration 13
Sorted data for price (in dollars) : 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins :
Bin1 : 4, 8, 15Bin2 : 21, 21, 24Bin3 : 25, 28, 34
Smoothing by bin means :
Bin1 : 9, 9, 9Bin2 : 22, 22, 22Bin3 : 29, 29, 29
Smoothing by bin boundaries :
Bin1 : 4, 4, 15Bin2 : 21, 21, 24Bin3 : 25, 25, 34
Regression
J. Pei: Big Data Analytics -- Data Cleaning and Integration 14
Outlier Analysis
J. Pei: Big Data Analytics -- Data Cleaning and Integration 15
Data Cleaning as a Process • Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution) – Check field overloading – Check uniqueness rule, consecutive rule and null rule – Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration – Data migration tools: allow transformations to be specified – ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface • Integration of the two processes
– Iterative and interactive (e.g., Potter�s Wheels)
J. Pei: Big Data Analytics -- Data Cleaning and Integration 16
Data Integration
• Combining data from multiple (autonomous and heterogeneous) sources
• Providing a unified view • Why is data integration hard?
– Systems challenges – Data logical organization challenges – Social and administrative challenges
J. Pei: Big Data Analytics -- Data Cleaning and Integration 17
Data Integration System Architecture
J. Pei: Big Data Analytics -- Data Cleaning and Integration 18
http://en.wikipedia.org/wiki/File:Dataintegration.png
2014-05-06
4
Wrappers
• Computer programs that extract content from a particular data source and transform into a target form, such as a relational table
• Example: CMS (content management system) wrapper
J. Pei: Big Data Analytics -- Data Cleaning and Integration 19
<html> <head> <title> %page_title%</title> </head> <body> %page_content% <P> %page_powered_by% </body> </html>
How to Build Wrappers?
• Manual construction • Machine learning based methods: learning
schemas from training data – Supervised learning approaches – Unsupervised learning approaches
J. Pei: Big Data Analytics -- Data Cleaning and Integration 20
Schema Matching and Mapping • Schema matching: finding the semantic
correspondences between attributes in data sources and those in the mediated schema – Example: “attribute name in source S1 corresponds to
attributes firstname and surname in the mediated schema
– Name based matching – Instance based matching
• Schema mapping: transforming attribute values from sources to mediated schema – Example: a query or a program extracting name values
from source S1, and forming firstname and surname values for the mediated schema
J. Pei: Big Data Analytics -- Data Cleaning and Integration 21
Entity Detection and Recognition
• Entity detection: identify atomic elements in text or other data into predefined categories such as person names, locations, organizations, etc.
• Entity disambiguation: identify entities carrying the same name
J. Pei: Big Data Analytics -- Data Cleaning and Integration 22
Example
J. Pei: Big Data Analytics -- Data Cleaning and Integration 23
Data Provenance
• The data about how a data entry came to be – Also known as data lineage/predigree
• The annotation approach: a series of annotations describing how each data item was produced
• The graph of data relationships approach: connecting sources and deriving new data items via mapping
J. Pei: Big Data Analytics -- Data Cleaning and Integration 24
2014-05-06
5
Deep / Hidden Web • Sites that are difficult for a crawler to find
– Probably over 100 times larger than the traditionally indexed web • Three major categories of sites in deep web
– Private sites intentionally private – no incoming links or may require login
– Form results – only accessible by entering data into a form, e.g., airline ticket queries
• Hard to detect changes behind a form – Scripted pages – using JavaScript, Flash, or another client-side
language in the web page • A crawler needs to execute the script – can slow down crawling
significantly • Deep web is different from dynamic pages
– Wikis dynamically generates web pages but are easy to crawl – Private sites are static but cannot be crawled
J. Pei: Big Data Analytics -- Data Cleaning and Integration 25
1
Multidimensional Analysis
Jian Pei: Big Data Analytics -- Multidimensional Analysis 2
Outline
• Why multidimensional analysis? • Multidimensional analysis principle • OLAP • OLAP indexes
Dimensions • “An aspect or feature of a situation, problem, or
thing, a measurable extent of some kind” – Dictionary
• Dimensions/attributes are used to model complex objects in a divide-and-conquer manner – Objects are compared in selected dimensions/
attributes • More often than not, objects have too many
dimensions/attributes than one is interested in and can handle
Jian Pei: Big Data Analytics -- Multidimensional Analysis 3
Multi-dimensional Analysis
• Find interesting patterns in multi-dimensional subspaces – “Michael Jordan is outstanding in subspaces (total
points, total rebounds, total assists) and (number of games played, total points, total assists)”
• Different patterns may be manifested in different subspaces – Feature selection (machine learning and statistics):
select a subset of relevant features for use in model construction – a set of features for all objects
– Different subspaces may manifest different patterns
Jian Pei: Big Data Analytics -- Multidimensional Analysis 4
Jian Pei: Big Data Analytics -- Multidimensional Analysis 5
OLAP • Conceptually, we may explore all possible subspaces for
interesting patterns – What patterns are interesting? – How can we explore all possible subspaces systematically and
efficiently? – Fundamental problems in analytics and data mining
• Aggregates and group-bys are frequently used in data analysis and summarization SELECT time, altitude, AVG(temp) FROM weather GOUP BY time, altitude; – In TPC, 6 standard benchmarks have 83 queries, aggregates are
used 59 times, group-bys are used 20 times • Online analytical processing (OLAP): the techniques
that answer multi-dimensional analytical (MDA) queries efficiently
Jian Pei: Big Data Analytics -- Multidimensional Analysis 6
OLAP Operations
• Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction – (Day, Store, Product type, SUM(sales) !
(Month, City, *, SUM(sales)) • Drill down (roll down): reverse of roll-up,
from higher level summary to lower level summary or detailed data, or introducing new dimensions
2
Other Operations
• Dice: pick specific values or ranges on some dimensions
• Pivot: “rotate” a cube – changing the order of dimensions in visual analysis
Jian Pei: Big Data Analytics -- Multidimensional Analysis 7
http://en.wikipedia.org/wiki/File:OLAP_pivoting.png
Jian Pei: Big Data Analytics -- Multidimensional Analysis 8
Relational Representation
• If there are n dimensions, there are 2n possible aggregation columns
Roll up by model by year by color in a table
Jian Pei: Big Data Analytics -- Multidimensional Analysis 9
Difficulties
• Many group bys are needed – 6 dimensions ! 26=64 group bys
• In most SQL systems, the resulting query needs 64 scans of the data, 64 sorts or hashes, and a long wait!
Jian Pei: Big Data Analytics -- Multidimensional Analysis 10
Dummy Value �ALL�
Jian Pei: Big Data Analytics -- Multidimensional Analysis 11
CUBE
SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39
DATA CUBE Model Year Color Sales
CUBE
Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941
SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color);
Jian Pei: Big Data Analytics -- Multidimensional Analysis 12
Semantics of ALL
• ALL is a set – Model.ALL = ALL(Model) = {Chevy, Ford } – Year.ALL = ALL(Year) = {1990,1991,1992} – Color.ALL = ALL(Color) = {red,white,blue}
3
Jian Pei: Big Data Analytics -- Multidimensional Analysis 13
OLTP Versus OLAP OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support DB design application-oriented subject-oriented
data current, up-to-date, detailed, flat relational Isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write, index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed
tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Jian Pei: Big Data Analytics -- Multidimensional Analysis 14
What Is a Data Warehouse?
• �A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management�s decision-making process.�
– W. H. Inmon • Data warehousing: the process of
constructing and using data warehouses
Jian Pei: Big Data Analytics -- Multidimensional Analysis 15
Subject-Oriented
• Organized around major subjects, such as customer, product, sales
• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing
• Providing a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
Jian Pei: Big Data Analytics -- Multidimensional Analysis 16
Integrated
• Integrating multiple, heterogeneous data sources – Relational databases, flat files, on-line transaction
records • Data cleaning and data integration
– Ensuring consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted
Jian Pei: Big Data Analytics -- Multidimensional Analysis 17
Time Variant
• The time horizon for the data warehouse is significantly longer than that of operational systems – Operational databases: current value data – Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years) • Every key structure in the data warehouse contains
an element of time, explicitly or implicitly – But the key of operational data may or may not contain �time element�
Jian Pei: Big Data Analytics -- Multidimensional Analysis 18
Nonvolatile
• A physically separate store of data transformed from the operational environment
• Operational updates of data do not occur in the data warehouse environment – Do not require transaction processing, recovery,
and concurrency control mechanisms – Require only two operations in data accessing
• Initial loading of data • Access of data
4
Jian Pei: Big Data Analytics -- Multidimensional Analysis 19
Why Separate Data Warehouse?
• High performance for both – Operational DBMS: tuned for OLTP – Warehouse: tuned for OLAP
• Different functions and different data – Historical data: data analysis often uses
historical data that operational databases do not typically maintain
– Data consolidation: data analysis requires consolidation (aggregation, summarization) of data from heterogeneous sources
Jian Pei: Big Data Analytics -- Multidimensional Analysis 20
Star Schema
time_key day day_of_the_week month quarter year
time
location_key street city state_or_province country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key item_name brand type supplier_type
item
branch_key branch_name branch_type
branch
Jian Pei: Big Data Analytics -- Multidimensional Analysis 21
Snowflake Schema
time_key day day_of_the_week month quarter year
time
location_key street city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key item_name brand type supplier_key
item
branch_key branch_name branch_type
branch
supplier_key supplier_type
supplier
city_key city state_or_province country
city
Fact Constellation
time_key day day_of_the_week month quarter year
time
location_key street city province_or_state country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales Measures
item_key item_name brand type supplier_type
item
branch_key branch_name branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key shipper_name location_key shipper_type
shipper
Jian Pei: Big Data Analytics -- Multidimensional Analysis 22
Jian Pei: Big Data Analytics -- Multidimensional Analysis 23
(Good) Aggregate Functions
• Distributive: there is a function G() such that F({Xi,j}) = G({F({Xi,j |i=1,...,I}) | j=1,...J}) – Examples: COUNT(), MIN(), MAX(), SUM() – G=SUM() for COUNT()
• Algebraic: there is an M-tuple valued function G() and a function H() such that F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., J }) – Examples: AVG(), standard deviation, MaxN(), MinN() – For AVG(), G() records sum and count, H() adds these
two components and divides to produce the global average
Jian Pei: Big Data Analytics -- Multidimensional Analysis 24
Holistic Aggregate Functions
• There is no constant bound on the size of the storage needed to describe a sub-aggregate. – There is no constant M, such that an M-tuple
characterizes the computation F({Xi,j |i=1,...,I}).
• Examples: Median(), MostFrequent() (also called the Mode()), and Rank()
5
Jian Pei: Big Data Analytics -- Multidimensional Analysis 25
Index Requirements in OLAP
• Data is read only – (Almost) no insertion or deletion
• Query types – Point query: looking up one specific tuple (rare) – Range query: returning the aggregate of a
(large) set of tuples, with group by – Complex queries: need specific algorithms and
index structures, will be discussed later
Jian Pei: Big Data Analytics -- Multidimensional Analysis 26
OLAP Query Example
• In table (cust, gender, …), find the total number of male customers
• Method 1: scan the table once • Method 2: build a B+ tree index on attribute
gender, still need to access all tuples of male customers
• Can we get the count without scanning many tuples, even not all tuples of male customers?
Jian Pei: Big Data Analytics -- Multidimensional Analysis 27
Bitmap Index
• For n tuples, a bitmap index has n bits and can be packed into !n /8" bytes and !n /32" words
• From a bit to the row-id: the j-th bit of the p-th byte ! row-id = p*8 +j cust gender …
Jack M … Cathy F … … … …
Nancy F …
1 0 … 0
Jian Pei: Big Data Analytics -- Multidimensional Analysis 28
Using Bitmap to Count
• Shcount[] contains the number of bits in the entry subscript – shcount[01100101]=4 count = 0; for (i = 0; i < SHNUM; i++) count += shcount[B[i]];
Jian Pei: Big Data Analytics -- Multidimensional Analysis 29
Advantages of Bitmap Index
• Efficient in space • Ready for logic composition
– C = C1 AND C2 – Bitmap operations can be used
• Bitmap index only works for categorical data with low cardinality – Naively, we need 50 bits per entry to represent
the state of a customer in US – How to represent a sale in dollars?
Jian Pei: Big Data Analytics -- Multidimensional Analysis 30
Bit-Sliced Index
• A sale amount can be written as an integer number of pennies, and then represented as a binary number of N bits – 24 bits is good for up to $167,772.15,
appropriate for many stores • A bit-sliced index is N bitmaps
– Tuple j sets in bitmap k if the k-th bit in its binary representation is on
– The space costs of bit-sliced index is the same as storing the data directly
6
Jian Pei: Big Data Analytics -- Multidimensional Analysis 31
Using Indexes
SELECT SUM(sales) FROM Sales WHERE C; – Tuples satisfying C is identified by a bitmap B
• Direct access to rows to calculate SUM: scan the whole table once
• B+ tree: find the tuples from the tree • Projection index: only scan attribute sales • Bit-sliced index: get the sum from ∑(B AND
Bk)*2k
Jian Pei: Big Data Analytics -- Multidimensional Analysis 32
Cost Comparison
• Traditional value-list index (B+ tree) is costly in both I/O and CPU time – Not good for OLAP
• Bit-sliced index is efficient in I/O • Other case studies in [O�Neil and Quass,
SIGMOD�97]
Jian Pei: Big Data Analytics -- Multidimensional Analysis 33
Horizontal or Vertical Storage
• A fact table for data warehousing is often fat – Tens of even hundreds of dimensions/attributes
• A query is often about only a few attributes • Horizontal storage: tuples are stored one by one • Vertical storage: tuples are stored by attributes
A1 A2 … A100
x1 x2 … x100
… … … … z1 z2 … z100
A1 A2 … A100
x1 x2 … x100
… … … … z1 z2 … z100
Jian Pei: Big Data Analytics -- Multidimensional Analysis 34
Horizontal Versus Vertical • Find the information of tuple t
– Typical in OLTP – Horizontal storage: get the whole tuple in one search – Vertical storage: search 100 lists
• Find SUM(a100) GROUP BY {a22, a83} – Typical in OLAP – Horizontal storage (no index): search all tuples O(100n),
where n is the number of tuples – Vertical storage: search 3 lists O(3n), 3% of the
horizontal storage method • Projection index: vertical storage
Jian Pei: Big Data Analytics -- Multidimensional Analysis 35
Rolling-up/Drilling-down Analysis Roll up by model by year by color Not a table, many NULL values, no key
Pivot
Jian Pei: Big Data Analytics -- Multidimensional Analysis 36
Extending GROUP BY SELECT Manufacturer, Year , Month, Day, Color, Model, SUM(price) AS Revenue FROM Sales GROUP BY Manufacturer, ROLLUP Year(Time) AS Year, Month(Time) AS Month, Day(Time) AS Day, CUBE Color, Model;
Manufacturer Year, Mo, Day
Mod
el x
Col
orcu
bes
7
Jian Pei: Big Data Analytics -- Multidimensional Analysis 37
CUBE
SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39
DATA CUBE Model Year Color Sales
CUBE
Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941
SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color);
Jian Pei: Big Data Analytics -- Multidimensional Analysis 38
MOLAP
Date
Produ
ct
Cou
ntry
sum
sum TV
VCR PC
1Qtr 2Qtr 3Qtr 4Qtr U.S.A
Canada
Mexico
sum
Jian Pei: Big Data Analytics -- Multidimensional Analysis 39
Pros and Cons
• Easy to implement • Fast retrieval • Many entries may be empty if data is sparse • Costly in space
Jian Pei: Big Data Analytics -- Multidimensional Analysis 40
ROLAP – Data Cube in Table
• A multi-dimensional database Base table
Dimensions Measure Store Product Season AVG(Sales)
S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9 S1 * Spring 9 … … … … * * * 9
Dimensions Measure Store Product Season Sales
S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9
Cubing
Jian Pei: Big Data Analytics -- Multidimensional Analysis 41
Observations
• Once a base table (A, B, C) is sorted by A-B-C, aggregates (*,*,*), (A,*,*), (A,B,*) and (A,B,C) can be computed with one scan and 4 counters
• To compute other aggregates, we can sort the base table in some other orders
Jian Pei: Big Data Analytics -- Multidimensional Analysis 42
How to Sort the Base Table?
• General sorting in main memory O(nlogn) • Counting in main memory O(n), linear to the
number of tuples in the base table – How to sort 1 million integers in range 0 to 100? – Set up 100 counters, initiate them to 0�s – Scan the integers once, count the occurrences
of each value in 1 to 100 – Scan the integers again, put the integers to the
right places
8
Jian Pei: Big Data Analytics -- Multidimensional Analysis 43
Iceberg Cube
• In a data cube, many aggregate cells are trivial – Having an aggregate too small
• Iceberg query
Jian Pei: Big Data Analytics -- Multidimensional Analysis 44
Monotonic Iceberg Condition
• If COUNT(a, b, *)<100, then COUNT(a, b, c)<100 for any c
• For cells c1 and c2, c1 is called an ancestor of c2 if in all dimensions that c1 takes a non-* value, c2 agrees with c1 – (a,b,*) is an ancestor of (a,b,c)
• An iceberg condition P is monotonic if for any aggregate cell c failing P, any descendants of c cannot honor P
Jian Pei: Big Data Analytics -- Multidimensional Analysis 45
Pushing Monotonic Conditions
• BUC searches the aggregates bottom-up in depth-first manner
• Only when a monotonic condition holds, the descendants of the current node should be expanded
Jian Pei: Big Data Analytics -- Multidimensional Analysis 46
How to Push Non-Monotonic Ones?
• Condition P(c)=AVG(price)>=800 AND COUNT(*)>=50 is not monotonic
• BUC cannot push such a constraint
Jian Pei: Big Data Analytics -- Multidimensional Analysis 47
Ideas
• Let AVGk(price) be the average of the top-k tuples
• AVGk(price)>=800 is a monotonic condition – If the top-10 average of (Vancouver, *, *) is less
than 800, the top-10 average of (Vancouver, laptop, *) cannot be 800 or more
• AVGk(price)>=800 can be a filter for AVG(price)>=800 – If AVGk(price)<800, AVG(price)<800 – Generally, AVG()<=AVGk()
Jian Pei: Big Data Analytics -- Multidimensional Analysis 48
Minimal Cubing
• Computing only a �shell� of a data cube – Only compute and materialize low dimensional
cuboids, dimensionality < k (k << n) – Save space and cubing time
• Indexing the shell cells as well as their cover – the tuples contributing to the shell cells
• Query answering – Using the shell cells and their intersection to
compute the non-materialized cells
9
Jian Pei: Big Data Analytics -- Multidimensional Analysis 49
A Data Cube Is Often Huge
• 10 dimensions, cardinality 20 for each dimension ! 2110=16,679,880,978,201 possible tuples in the cube
• Even 1/1,000 of possible tuples are not empty, still more than 16 billion tuples
Jian Pei: Big Data Analytics -- Multidimensional Analysis 50
Compression of Data Cubes
• Traditional compression methods, e.g., zip – High compression ratio – The compression cannot be queried directly
• Requirements for data cube compression – The compression can be queried efficiently – High compression ratio
• Lossless compression and lossy compression
Jian Pei: Big Data Analytics -- Multidimensional Analysis 51
Redundancy in Data Cube
• A base table with only one tuple (a1, …, a100, 1000) and aggregate function SUM() – The data cube contains 2100 tuples! – Every query about SUM() returns 1000
• A data cube or a sub-cube may be populated by a single tuple – base single tuple
• We do not need to pre-compute and store all aggregates
Jian Pei: Big Data Analytics -- Multidimensional Analysis 52
A Little More General Case
• A base table with two tuples, t1 = (a1, a2, b3, b4, 100) and t2 = (a1, a2, c3, c4, 1000), aggregate function SUM()
• (a1, a2, *, *), (a1, *, *, *), (*, a2, *, *) and (*, *, *, *) all have sum 1100, since they are populated by the group of tuples {t1, t2} – base group tuples
Jian Pei: Big Data Analytics -- Multidimensional Analysis 53
Semantic Compression • Can we summarize a data cube so that the
summarization can be browsed and understood effectively? – The summarization itself is a compression – The compression preserves the roll-up/drill-down
relation – Directly query-able and browse-able for OLAP
• Syntactic compression – Not preserving the roll-up/drill-down semantics – Directly query-able for some queries, but may not be
directly browse-able for OLAP
Jian Pei: Big Data Analytics -- Multidimensional Analysis 54
Cube Cell Lattice • Observation: many cells may have same
aggregate values • Can we summarize the semantics of the cube by
grouping cells by aggregate values? (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9 (*,P1,f):9
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
(*,*,*):9
10
Jian Pei: Big Data Analytics -- Multidimensional Analysis 55
A Naïve Attempt
• Put all cells of same agg values into a class • The result is not a lattice anymore!
– Anomaly: the rollup/drilldown semantics is lost
C1 C2 C3
C4
(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9 (*,P1,f):9
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
(*,*,*):9
Jian Pei: Big Data Analytics -- Multidimensional Analysis 56
A Better Partitioning
• Quotient cube: partitioning preserving the rollup/drilldown semantics
(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
(*,*,*):9
C1 C3
C5
C4
C2
Jian Pei: Big Data Analytics -- Multidimensional Analysis 57
Why Semantic Compression Useful?
C1 C2
C5
C4
C3
(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
(*,*,*):9
Jian Pei: Big Data Analytics -- Multidimensional Analysis 58
Why Semantic Compression Useful?
C1 C2
C5
C4
(S2,P1,f):9
(S2,*,f):9 (S2,P1,*) (*,P1,f):9
(*,*,f):9 (S2,*,*):9
• OLAP browsing
(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
(*,*,*):9
Jian Pei: Big Data Analytics -- Multidimensional Analysis 59
Goals
• Given a cube, characterize a good way (the quotient cube way) of partitioning its cells into classes such that – The partition generates a reduced lattice
preserving the roll-up/drill-down semantics – The partition is optimal: the number of classes
as small as possible • Compute, index and store quotient cubes
efficiently to answer OLAP queries
Jian Pei: Big Data Analytics -- Multidimensional Analysis 60
Why Equivalent Aggregate Values?
• Two cells have equivalent aggregate values if they cover the same set of tuples in the base table
(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
(*,*,*):9
Tuples in base table
11
Jian Pei: Big Data Analytics -- Multidimensional Analysis 61
Cover Partition
• For a cell c, a tuple t in base table is in c�s cover if t can be rolled up to c – E.g., Cov(S1,*,spring)={(S1,P1,spring),
(S1,P2,spring)}
Dimensions Measure Store Product Season Sales
S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9
Jian Pei: Big Data Analytics -- Multidimensional Analysis 62
Cover Partitions & Aggregates
• All cells in a cover partition carry the same aggregate value with respect to any aggregate function – But cells in a class of MIN() may have different
covers • For COUNT() and SUM() (positive), cover
equivalence coincides with aggregate equivalence
Jian Pei: Big Data Analytics -- Multidimensional Analysis 63
Quotient Cube
• A quotient cube is a quotient lattice of the cube lattice such that – Each class is convex and connected – All cells in a class carry the identical aggregate
value w.r.t. a given aggregate function • Quotient cube preserves the roll-up / drill-
down semantics
Jian Pei: Big Data Analytics -- Multidimensional Analysis 64
• “?�7*���%��– ���·���� – Two dimensions: ?�and 7*�– Preferences: “?�'(1��7* '(1�” – Multidimensional decision problems have a long
history – more than 2300 years • Multidimensional decision problems are
often challenging – >2�$�– ����&� – ��<�– �·-+��63=�
Multi-Criteria Decision Problems
Jian Pei: Big Data Analytics -- Multidimensional Analysis 65
Skyline – Best Tradeoffs
• Two dimensions: distance to water and height • Skyline: the buildings that are not dominated by
any other buildings in both dimensions
SFU Harbor Center
Jian Pei: Big Data Analytics -- Multidimensional Analysis 66
Skyline: Formal Definition
• A set of objects S in an n-dimensional space D=(D1, …, Dn) – Numeric dimensions for illustration in this talk
• For u, v ∈ S, u dominates v if – u is better than v in one dimension, and – u is not worse than v in any other dimensions – For illustration in this talk, the smaller the better
• u ∈ S is a skyline object if u is not dominated by any other objects in S
12
Example
Jian Pei: Big Data Analytics -- Multidimensional Analysis 67
skyline points
Price
travel time
uv
Jian Pei: Big Data Analytics -- Multidimensional Analysis 68
Skyline Computation
• First investigated as the maximum vector problem in [Kung et al. JACM 1975] – An O(n logd-2n) time algorithm for d ≥ 4 and an
O(n log n) time algorithm for d = 2 and 3 – Divide-and-conquer-based methods: DD&C,
LD&C, FLET • Skyline computation in database context
– Data cannot be held into main memory – External algorithms
Jian Pei: Big Data Analytics -- Multidimensional Analysis 69
Skyline Computation on Large DB
• A rule of thumb in database research – scalability on large databases
• Index-based methods – Using bitmaps and the relationships between the skyline
and the minimum coordinates of individual points, by Tan et al.
– Using nearest-neighbor search by Kossmann et al. – The progressive branch-and-bound method by
Papadias et al. • Index-free methods
– Divide-and-conquer and block nested loops by Borzsonyi et al.
– Sort-first-skyline (SFS) by Chomicki et al.
Jian Pei: Big Data Analytics -- Multidimensional Analysis 70
Full Space Skyline Is Not Enough!
• Skylines in subspaces – Skyline in space (# stops, price, travel-time) – If one does not care about # stops, how can we
derive the superior trade-offs between price and travel-time from the full space skyline?
• Sky cube – computing skylines in all non-empty subspaces (Yuan et al., VLDB�05) – A database/data warehousing approach – Any subspace skyline queries can be answered
(efficiently)
Jian Pei: Big Data Analytics -- Multidimensional Analysis 71
Sky Cube
Jian Pei: Big Data Analytics -- Multidimensional Analysis 72
Understanding Skylines
• 9,54 • Both Wilt Chamberlain and Michael Jordan
are in the full space skyline of the Great NBA Players
• Data mining/exploration-driven questions – Which merits, respectively, really make them
outstanding? – How are they different?
13
Jian Pei: Big Data Analytics -- Multidimensional Analysis 73
Redundancy in Sky Cube
Does it just happen that skylines in multiple subspaces are identical?
Jian Pei: Big Data Analytics -- Multidimensional Analysis 74
Mining Decisive Subspaces
• Decisive subspaces – the minimal combinations of factors that determine the (subspace) skyline membership of an object
• Examples – Total rebounds for Chamberlain – For Jordan, (total points, total rebounds, total
assists) and (games played, total points, total assists)
• Details in [Pei et al., VLDB 2005]
Jian Pei: Big Data Analytics -- Multidimensional Analysis 75
Database & Data Mining Can Meet
• ��� – �·������ • Conceptually, computing skylines in all subspaces • Only computing skyline groups and their decisive
subspaces – Concise representation, leading to fast algorithms – [Pei et al., ACM TODS 2006]
• Improvement: borrowing frequent itemset mining techniques to speed up computation in high dimensional spaces [Pei et al., ICDE 2007]
Jian Pei: Big Data Analytics -- Multidimensional Analysis 76
DB Extensions and Applications
• Improving database query answering – Efficient skyline query answering in subspaces
[Tao et al., ICDE 2006] – Effective summary of skyline: distance-based
representative skyline [Tao et al., ICDE 2009] • Extensions in data types
– Probabilistic skylines on uncertain data [Pei et al., VLDB 2007]
– Interval skyline queries on time series [Jiang and Pei, ICDE 2009]
Jian Pei: Big Data Analytics -- Multidimensional Analysis 77
Dynamic User Preferences
Different customers may have different preferences
• ��<:�– �·0;��"8#�
Jian Pei: Big Data Analytics -- Multidimensional Analysis 78
Personalized Recommendations
• )�(��– � �·!./� �
14
Jian Pei: Big Data Analytics -- Multidimensional Analysis 79
Favorable Facet Mining
• ���� – �·������� • A set of points in a multidimensional space
– Fully ordered attributes: the preference orders are fixed, e.g., price, star-level, and quality
– (Categorical) Partially ordered attributes: the preference orders are not fully determined, e.g., airlines, hotel groups, and property types
• Some templates may apply, e.g., single houses > semi-detached houses
• Favorable facts of a point p: the partial orders that make p in the skyline
Jian Pei: Big Data Analytics -- Multidimensional Analysis 80
Monotonicity of Partial Orders If p is not in the skyline with respect to partial R, p is not in the skyline with any partial order stronger than R
Jian Pei: Big Data Analytics -- Multidimensional Analysis 81
Minimal Disqualifying Conditions
• ���� • For a point p, a most general partial order that
disqualifies p in the skyline is a minimal disqualifying condition (MDC)
• Any partial orders stronger than an MDC cannot make p in the skyline
• How to compute MDC�s efficiently? – MDC-O: computing MDC�s on the fly – MDC-M: materializing MDC�s – Details in [Wong et al., KDD 2007]
Jian Pei: Big Data Analytics -- Multidimensional Analysis 82
Skyline Warehouse on Preferences
• Materializing all MCDs and precompute skylines – Using an Implicit Preference Order tree (IPO-tree) index
• Can online answer skyline queries with respect to any user preferences
• Details in [Wong et al., VLDB 2008]
Jian Pei: Big Data Analytics -- Multidimensional Analysis 83
Learning User Preferences • Realtors selling realties – a typical multi-criteria
decision problem – User preferences on multiple dimensions: location, size,
price, style, age, developer, … – Thousands of realties
• How can a realtor learn a user�s preferences on dimensions? – ��������� – ���·�� – Give a user a short list of realties and ask the user to
pick the ones (s)he is/is not interested in – An interesting realty – a skyline point in the short list – An uninteresting realty – a non-skyline in the short list
Jian Pei: Big Data Analytics -- Multidimensional Analysis 84
Mining Preferences from Examples
• ���� • Given a set of example points labeled skyline or
non-skyline in a multidimensional space, can we learn the preferences on attributes? – Favorable facets are for one superior example only
• Mining the minimal satisfying preference sets (SPS) – The simplest hypotheses that fit the superior and inferior
examples
15
Jian Pei: Big Data Analytics -- Multidimensional Analysis 85
Learning Methods
• Complexity – The SPS existence problem is NP-hard – The minimal SPS problem is NP-hard
• A greedy approach – The term-based greedy algorithm – The condition-based greedy algorithm – Details in [Jiang et al., KDD�08]
Multidimensional Analysis of Logs
• Look-up: “What are the top-5 electronics that were most popularly searched by the users in the US in December, 2009?”
• Reverse look-up: “What are the group-bys in time and region where Apple iPad was popularly searched for?”
• Different users/applications may bear different concept hierarchies in mind in their multidimensional analysis
Jian Pei: Big Data Analytics -- Multidimensional Analysis 86
A Topic-Concept Cube Approach
Jian Pei: Big Data Analytics -- Multidimensional Analysis 87
A Successful Case Study
Jian Pei: Big Data Analytics -- Multidimensional Analysis 88