Data Warehousing OverviewCS245 Notes 11Hector Garcia-MolinaStanford UniversityCS 245*Notes11
Notes11
CS 245Notes11*OutlineWhat is a data warehouse?Why a warehouse?Models & operationsImplementing a warehouse
Notes11
CS 245Notes11*What is a Warehouse?Collection of diverse datasubject orientedaimed at executive, decision makeroften a copy of operational datawith value-added data (e.g., summaries, history)integratedtime-varyingnon-volatile
Notes11
CS 245Notes11*What is a Warehouse?Collection of toolsgathering datacleansing, integrating, ...querying, reporting, analysisdata miningmonitoring, administering warehouse
Notes11
CS 245Notes11*Warehouse ArchitectureMetadata
Notes11
CS 245Notes11*Motivating ExamplesForecastingComparing performance of unitsMonitoring, detecting fraudVisualization
Notes11
CS 245Notes11*Alternative to WarehousingTwo Approaches:Query-Driven (Lazy)Warehouse (Eager)
Notes11
CS 245Notes11*Query-Driven Approach
Notes11
CS 245Notes11*Advantages of WarehousingHigh query performanceQueries not visible outside warehouseLocal processing at sources unaffectedCan operate when sources unavailableCan query data not stored in a DBMSExtra information at warehouseModify, summarize (store aggregates)Add historical information
Notes11
CS 245Notes11*Advantages of Query-DrivenNo need to copy dataless storageno need to purchase dataMore up-to-date dataQuery needs can be unknownOnly query interface needed at sourcesMay be less draining on sources
Notes11
CS 245Notes11*Warehouse Models & OperatorsData ModelsrelationalcubesOperators
Notes11
CS 245Notes11*Star
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
custId
name
address
city
product
id
name
price
store
code
city
sale
custId
prodId
storeId
qty
amt
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
53.0
p1
c1
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
53.0
p2
c1
111.0
sally
80 willow
la
c3
la
111.0
p1
c3
product
id
name
price
p1
bolt
10.0
p2
nut
5.0
store
code
city
c1
nyc
c2
sfo
c3
la
sale
custId
prodId
storeId
qty
amt
53.0
p1
c1
53.0
p2
c1
111.0
p1
c3
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
prodId
name
price
store
code
city
sale
custId
prodId
storeId
qty
amt
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
53.0
p1
c1
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
53.0
p2
c1
111.0
sally
80 willow
la
c3
la
111.0
p1
c3
product
id
name
price
p1
bolt
10.0
p2
nut
5.0
store
code
city
c1
nyc
c2
sfo
c3
la
sale
custId
prodId
storeId
qty
amt
53.0
p1
c1
53.0
p2
c1
111.0
p1
c3
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
storeId
city
sale
custId
prodId
storeId
qty
amt
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
53.0
p1
c1
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
53.0
p2
c1
111.0
sally
80 willow
la
c3
la
111.0
p1
c3
product
id
name
price
p1
bolt
10.0
p2
nut
5.0
store
code
city
c1
nyc
c2
sfo
c3
la
sale
custId
prodId
storeId
qty
amt
53.0
p1
c1
53.0
p2
c1
111.0
p1
c3
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
product
id
name
price
p1
bolt
10.0
p2
nut
5.0
store
code
city
c1
nyc
c2
sfo
c3
la
sale
custId
prodId
storeId
qty
amt
53.0
p1
c1
53.0
p2
c1
111.0
p1
c3
CS 245Notes11*Star Schema
Notes11
CS 245Notes11*TermsFact tableDimension tablesMeasures
Notes11
CS 245Notes11*Dimension HierarchiesstoresTypecityregion snowflake schema constellations
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
storeId
cityId
tId
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
prodId
storeId
product
id
name
price
qty
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
regId
region
regId
name
sType
tId
size
location
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
prodId
storeId
product
id
name
price
qty
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
prodId
storeId
product
id
name
price
qty
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
prodId
storeId
product
id
name
price
qty
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
CS 245Notes11*CubeFact table view:Multi-dimensional cube:dimensions = 2
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
12.0
p1
12.0
50.0
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
11.0
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
50.0
prodId
p2
c2
8.0
storeId
product
id
name
price
qty
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
12.0
p1
12.0
50.0
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
11.0
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
50.0
prodId
p2
c2
8.0
storeId
product
id
name
price
qty
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
CS 245Notes11*3-D Cubedimensions = 3Multi-dimensional cube:Fact table view:
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
12.0
50.0
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
OperatorsTraditionalselectionaggregation...Analysisclean datafind trends...CS 245Notes11*Relational
Cube
Notes11
CS 245Notes11*Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 181
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
12.0
50.0
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
CS 245Notes11*Aggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
12.0
50.0
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
CS 245Notes11*Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodIddrill-downrollup
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
12.0
50.0
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
CS 245Notes11*AggregatesOperators: sum, count, max, min, median, aveHaving clauseUsing dimension hierarchyaverage by region (within store)maximum by month (within date)
Notes11
CS 245Notes11*Cube Aggregationday 2day 1129. . .Example: computing sums
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
12.0
50.0
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
Sheet1
customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtc1c2c3
53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112sum671250
111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150
prodIdp2c218
storeIdp1c1244
productidnamepriceqtyp1c224
p1bolt10amt
p2nut5
customer
id
storecodecityname
c1nycaddress
c2sfocity
c3la
product
id
salecustIdprodIdstoreIdqtyamtname
53p1c1112price
53p2c1211
111p1c3550store
code
city
&A
Page &P
Sheet2
&A
Page &P
Sheet3
&A
Page &P
Sheet4
&A
Page &P
Sheet5
&A
Page &P
Sheet6
&A
Page &P
Sheet7
&A
Page &P
Sheet8
&A
Page &P
Sheet9
&A
Page &P
Sheet10
&A
Page &P
Sheet11
&A
Page &P
Sheet12
&A
Page &P
Sheet13
&A
Page &P
Sheet14
&A
Page &P
Sheet15
&A
Page &P
Sheet16
&A
Page &P
Sheet1
customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtsum
53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112p1110
81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsp2c1111p219
111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150
prodIdp2c218
storeIdp1c1244
productidnamepriceqtyp1c224
p1bolt10amt
p2nut5
customer
id
storecodecityname
c1nycaddress
c2sfocity
c3la
product
id
salecustIdprodIdstoreIdqtyamtname
53p1c1112price
53p2c1211
111p1c3550store
code
city
&A
Page &P
Sheet2
&A
Page &P
Sheet3
&A
Page &P
Sheet4
&A
Page &P
Sheet5
&A
Page &P
Sheet6
&A
Page &P
Sheet7
&A
Page &P
Sheet8
&A
Page &P
Sheet9
&A
Page &P
Sheet10
&A
Page &P
Sheet11
&A
Page &P
Sheet12
&A
Page &P
Sheet13
&A
Page &P
Sheet14
&A
Page &P
Sheet15
&A
Page &P
Sheet16
&A
Page &P
CS 245Notes11*Cube Operatorsday 2day 1129. . .sale(c1,*,*)sale(*,*,*)sale(c2,p2,*)
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
12.0
50.0
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
Sheet1
customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtc1c2c3
53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112sum671250
111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150
prodIdp2c218
storeIdp1c1244
productidnamepriceqtyp1c224
p1bolt10amt
p2nut5
customer
id
storecodecityname
c1nycaddress
c2sfocity
c3la
product
id
salecustIdprodIdstoreIdqtyamtname
53p1c1112price
53p2c1211
111p1c3550store
code
city
&A
Page &P
Sheet2
&A
Page &P
Sheet3
&A
Page &P
Sheet4
&A
Page &P
Sheet5
&A
Page &P
Sheet6
&A
Page &P
Sheet7
&A
Page &P
Sheet8
&A
Page &P
Sheet9
&A
Page &P
Sheet10
&A
Page &P
Sheet11
&A
Page &P
Sheet12
&A
Page &P
Sheet13
&A
Page &P
Sheet14
&A
Page &P
Sheet15
&A
Page &P
Sheet16
&A
Page &P
Sheet1
customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtsum
53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112p1110
81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsp2c1111p219
111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150
prodIdp2c218
storeIdp1c1244
productidnamepriceqtyp1c224
p1bolt10amt
p2nut5
customer
id
storecodecityname
c1nycaddress
c2sfocity
c3la
product
id
salecustIdprodIdstoreIdqtyamtname
53p1c1112price
53p2c1211
111p1c3550store
code
city
&A
Page &P
Sheet2
&A
Page &P
Sheet3
&A
Page &P
Sheet4
&A
Page &P
Sheet5
&A
Page &P
Sheet6
&A
Page &P
Sheet7
&A
Page &P
Sheet8
&A
Page &P
Sheet9
&A
Page &P
Sheet10
&A
Page &P
Sheet11
&A
Page &P
Sheet12
&A
Page &P
Sheet13
&A
Page &P
Sheet14
&A
Page &P
Sheet15
&A
Page &P
Sheet16
&A
Page &P
CS 245Notes11*Extended Cubeday 2day 1*sale(*,p2,*)
Notes11
Sheet1
customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtc1c2c3*
53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112p156450110
81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsp2c1111p211819
111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150*671250129
prodIdp2c218
storeIdp1c1244
productidnamepriceqtyp1c224
p1bolt10amt
p2nut5
customer
id
storecodecityname
c1nycaddress
c2sfocity
c3la
product
id
salecustIdprodIdstoreIdqtyamtname
53p1c1112price
53p2c1211
111p1c3550store
code
city
&A
Page &P
Sheet2
&A
Page &P
Sheet3
&A
Page &P
Sheet4
&A
Page &P
Sheet5
&A
Page &P
Sheet6
&A
Page &P
Sheet7
&A
Page &P
Sheet8
&A
Page &P
Sheet9
&A
Page &P
Sheet10
&A
Page &P
Sheet11
&A
Page &P
Sheet12
&A
Page &P
Sheet13
&A
Page &P
Sheet14
&A
Page &P
Sheet15
&A
Page &P
Sheet16
&A
Page &P
Sheet1
customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtc1c2c3*
53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112p144448
81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsp2c1111p2
111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150*44448
prodIdp2c218
storeIdp1c1244
productidnamepriceqtyp1c224
p1bolt10amt
p2nut5
customer
id
storecodecityname
c1nycaddress
c2sfocity
c3la
product
id
salecustIdprodIdstoreIdqtyamtname
53p1c1112price
53p2c1211
111p1c3550store
code
city
&A
Page &P
Sheet2
&A
Page &P
Sheet3
&A
Page &P
Sheet4
&A
Page &P
Sheet5
&A
Page &P
Sheet6
&A
Page &P
Sheet7
&A
Page &P
Sheet8
&A
Page &P
Sheet9
&A
Page &P
Sheet10
&A
Page &P
Sheet11
&A
Page &P
Sheet12
&A
Page &P
Sheet13
&A
Page &P
Sheet14
&A
Page &P
Sheet15
&A
Page &P
Sheet16
&A
Page &P
Sheet1
customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtc1c2c3*
53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112p1125062
81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsp2c1111p211819
111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150*2385081
prodIdp2c218
storeIdp1c1244
productidnamepriceqtyp1c224
p1bolt10amt
p2nut5
customer
id
storecodecityname
c1nycaddress
c2sfocity
c3la
product
id
salecustIdprodIdstoreIdqtyamtname
53p1c1112price
53p2c1211
111p1c3550store
code
city
&A
Page &P
Sheet2
&A
Page &P
Sheet3
&A
Page &P
Sheet4
&A
Page &P
Sheet5
&A
Page &P
Sheet6
&A
Page &P
Sheet7
&A
Page &P
Sheet8
&A
Page &P
Sheet9
&A
Page &P
Sheet10
&A
Page &P
Sheet11
&A
Page &P
Sheet12
&A
Page &P
Sheet13
&A
Page &P
Sheet14
&A
Page &P
Sheet15
&A
Page &P
Sheet16
&A
Page &P
CS 245Notes11*Aggregation Using Hierarchiescustomerregioncountry(customer c1 in Region A;customers c2, c3 in Region B)
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
region A
region B
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
CS 245Notes11*Data AnalysisDecision TreesClusteringAssociation Rules
Notes11
CS 245Notes11*Decision TreesExample: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaigntrainingset
Notes11
Sheet1
customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsalecustIdcaragecitynewCarc1c2c3
53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownc1taurus27sfyesp11250
81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsc2van35layesp2118
111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyc3van40sfyes
prodIdc4taurus22sfyes
storeIdc5merc50lano
productidnamepriceqtyc6taurus25lano
p1bolt10amt
p2nut5
customer
id
storecodecityname
c1nycaddress
c2sfocity
c3la
product
id
salecustIdprodIdstoreIdqtyamtname
53p1c1112price
53p2c1211
111p1c3550store
code
city
&A
Page &P
Sheet2
&A
Page &P
Sheet3
&A
Page &P
Sheet4
&A
Page &P
Sheet5
&A
Page &P
Sheet6
&A
Page &P
Sheet7
&A
Page &P
Sheet8
&A
Page &P
Sheet9
&A
Page &P
Sheet10
&A
Page &P
Sheet11
&A
Page &P
Sheet12
&A
Page &P
Sheet13
&A
Page &P
Sheet14
&A
Page &P
Sheet15
&A
Page &P
Sheet16
&A
Page &P
CS 245Notes11*IssuesDecision tree cannot be too deepwould not have statistically significant amounts of data for lower decisionsNeed to select tree that most reliably predicts outcomes
Notes11
CS 245Notes11*Clusteringageincomeeducation
Notes11
CS 245Notes11*Clusteringageincomeeducation
Notes11
CS 245Notes11*Another Example: TextEach document is a vectore.g., contains words 1,4,5,...Clusters contain similar documentsUseful for understanding, searching documentsinternationalnewssportsbusiness
Notes11
CS 245Notes11*IssuesGiven desired number of clusters?Finding best clustersAre clusters semantically meaningful?e.g., yuppies cluster?Using clusters for disk storage
Notes11
CS 245Notes11*Association Rule Miningtransactionidcustomeridproductsboughtsalesrecords: Trend: Products p5, p8 often bough together Trend: Customer 12 likes product p9market-basketdata
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
tran1
cust33
p2, p5, p8
tran2
cust45
p5, p8, p11
tran3
cust12
p1, p9
tran4
cust40
p5, p8, p11
tran5
cust12
p2, p9
tran6
cust12
p9
CS 245Notes11*Association RuleRule: {p1, p3, p8}Support: number of baskets where these products appearHigh-support set: support threshold sProblem: find all high support sets
Notes11
Implementation IssuesETL (Extraction, transformation, loading)Getting data to the warehouseEntity ResolutionWhat to materialize?Efficient AnalysisAssociation rule mining...
CS 245Notes11*
Notes11
CS 245Notes11*ETL: Monitoring TechniquesPeriodic snapshotsDatabase triggersLog shippingData shipping (replication service)Transaction shippingPolling (queries to source)Screen scrapingApplication level monitoring Advantages & Disadvantages!!
Notes11
CS 245Notes11*ETL: Data CleaningMigration (e.g., yen dollars)Scrubbing: use domain-specific knowledge (e.g., social security numbers)Fusion (e.g., mail list, customer merging)
Auditing: discover rules & relationships (like data mining)
Notes11
*More details: Entity ResolutionN: a A: b CC#: c Ph: ee1N: a Exp: d Ph: ee2
*Applicationscomparison shoppingmailing listsclassified adscustomer filescounter-terrorismN: a A: b CC#: c Ph: ee1N: a Exp: d Ph: ee2
Why is ER Challenging?Huge data setsNo unique identifiersLots of uncertaintyMany ways to skin the cat*
Taxonomy: Pairwise vs GlobalDecide if r, s match only by looking at r, s?Or need to consider more (all) records?*Nm: Pat SmithAd: 123 Main StPh: (650) 555-1212Nm: Patrick SmithAd: 132 Main StPh: (650) 555-1212Nm: Patricia SmithAd: 123 Main StPh: (650) 777-1111or
Taxonomy: Pairwise vs GlobalGlobal matching complicates things a lot!e.g., change decision as new records arrive*Nm: Pat SmithAd: 123 Main StPh: (650) 555-1212Nm: Patrick SmithAd: 132 Main StPh: (650) 555-1212Nm: Patricia SmithAd: 123 Main StPh: (650) 777-1111or
Taxonomy: OutcomePartition of recordse.g., comparison shoppingMerged records*Nm: Pat SmithAd: 123 Main StPh: (650) 555-1212Nm: Patricia SmithAd: 123 Main StPh: (650) 555-1212 (650) 777-1111Hair: BlackNm: Patricia SmithAd: 132 Main StPh: (650) 777-1111Hair: Black
*Taxonomy: OutcomeIterate after mergingNm: TomWk: IBMOc: laywerSal: 500KNm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMNm: ThomasAd: 123 MaimOc: lawyerNm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerNm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
Taxonomy: Record ReuseOne record related to multiple entities?*Nm: Pat Smith Sr.Ph: (650) 555-1212Ph: (650) 555-1212Ad: 123 Main StNm: Pat Smith Jr.Ph: (650) 555-1212Nm: Pat Smith Sr.Ph: (650) 555-1212Ad: 123 Main StNm: Pat Smith Jr.Ph: (650) 555-1212Ad: 123 Main St
Taxonomy: Record ReusePartitions*Mergesr s trstrsst
Taxonomy: Record ReusePartitions*Mergesr s trstrsst Record reuse complex and expensive!
*Taxonomy: Multiple Entity Typesbrothermemberbusinessmember
*Taxonomy: Multiple Entity Typesauthorspaperssame??
*Taxonomy: Exact vs ApproximateproductscamerasresolvedcamerasCDsbooks...resolvedCDsresolvedbooks...ERERER
*Taxonomy: Exact vs Approximateterroriststerroristssortby ageB Cooper 30match againstages 25-35
Implementation IssuesETL (Extraction, transformation, loading)Getting data to the warehouseEntity ResolutionWhat to materialize?Efficient AnalysisAssociation rule mining...
CS 245Notes11*
Notes11
CS 245Notes11*What to Materialize?Store in warehouse results useful for common queriesExample:day 2day 1129. . .total salesmaterialize
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
12.0
50.0
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
110.0
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
19.0
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
CS 245Notes11*Materialization FactorsType/frequency of queriesQuery response timeStorage costUpdate cost
Notes11
CS 245Notes11*Cube Aggregates Latticecity, product, datecity, productcity, dateproduct, datecityproductdateall129use greedyalgorithm todecide whatto materialize
Notes11
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
81.0
fred
12 main
sfo
p2
nut
5.0
c2
sfo
o102
2/7/97
53.0
p2
c1
date
s7
sfo
t2
fred
la
5M
south
south
warm region
t2
large
suburbs
p2
c1
p2
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
Sheet: Sheet1
Sheet: Sheet2
Sheet: Sheet3
Sheet: Sheet4
Sheet: Sheet5
Sheet: Sheet6
Sheet: Sheet7
Sheet: Sheet8
Sheet: Sheet9
Sheet: Sheet10
Sheet: Sheet11
Sheet: Sheet12
Sheet: Sheet13
Sheet: Sheet14
Sheet: Sheet15
Sheet: Sheet16
customer
id
name
address
city
product
id
name
price
store
code
city
sale
oderId
date
custId
prodId
storeId
qty
amt
sale
store
code
city
type
mgr
city
cityId
pop
region
region
regId
name
sType
tId
size
location
sale
prodId
storeId
date
amt
c1
c2
c3
53.0
joe
10 main
sfo
p1
bolt
10.0
c1
nyc
o100
1/7/97
53.0
p1
c1
orderId
s5
sfo
t1
joe
sfo
1M
north
north
cold region
t1
small
downtown
p1
c1
p1
111.0
sally
80 willow
la
c3
la
105.0
3/8/97
111.0
p1
c3
custId
s9
la
t1
nancy
p1
c3
prodId
p2
c2
storeId
p1
c1
product
id
name
price
qty
p1
c2
p1
bolt
10.0
amt
p2
nut
5.0
customer
id
store
code
city
name
c1
nyc
address
c2
sfo
city
c3
la
product
id
sale
custId
prodId
storeId
qty
amt
name
53.0
p1
c1
price
53.0
p2
c1
111.0
p1
c3
store
code
city
CS 245Notes11*Dimension Hierarchiesallstatecity
Notes11
Sheet1
customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtcitiescitystate
53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112c1CA
81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsp2c1111c2NY
111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150
prodIdp2c218
storeIdp1c1244
productidnamepriceqtyp1c224
p1bolt10amt
p2nut5
customer
id
storecodecityname
c1nycaddress
c2sfocity
c3la
product
id
salecustIdprodIdstoreIdqtyamtname
53p1c1112price
53p2c1211
111p1c3550store
code
city
&A
Page &P
Sheet2
&A
Page &P
Sheet3
&A
Page &P
Sheet4
&A
Page &P
Sheet5
&A
Page &P
Sheet6
&A
Page &P
Sheet7
&A
Page &P
Sheet8
&A
Page &P
Sheet9
&A
Page &P
Sheet10
&A
Page &P
Sheet11
&A
Page &P
Sheet12
&A
Page &P
Sheet13
&A
Page &P
Sheet14
&A
Page &P
Sheet15
&A
Page &P
Sheet16
&A
Page &P
CS 245Notes11*Dimension Hierarchiescity, productcity, product, datecity, dateproduct, datecityproductdateallstate, product, datestate, datestate, productstatenot all arcs shown...
Notes11
CS 245Notes11*Interesting Hierarchyallyearsquartersmonthsdaysweeksconceptualdimension table
Notes11
Sheet1
customeridnameaddresscityproductidnamepricestorecodecitytimedayweekmonthquarteryear
53joe10 mainsfop1bolt10c1nyc11112000
81fred12 mainsfop2nut5c2sfo21112000
111sally80 willowlac3la31112000
41112000
51112000
productidnameprice61112000
p1bolt1071112000
p2nut582112000
storecodecity
c1nyc
c2sfo
c3la
salecustIdprodIdstoreIdqtyamt
53p1c1112
53p2c1211
111p1c3550
&A
Page &P
Sheet2
&A
Page &P
Sheet3
&A
Page &P
Sheet4
&A
Page &P
Sheet5
&A
Page &P
Sheet6
&A
Page &P
Sheet7
&A
Page &P
Sheet8
&A
Page &P
Sheet9
&A
Page &P
Sheet10
&A
Page &P
Sheet11
&A
Page &P
Sheet12
&A
Page &P
Sheet13
&A
Page &P
Sheet14
&A
Page &P
Sheet15
&A
Page &P
Sheet16
&A
Page &P
Implementation IssuesETL (Extraction, transformation, loading)Getting data to the warehouseEntity ResolutionWhat to materialize?Efficient AnalysisAssociation rule mining...
CS 245Notes11*
Notes11
CS 245Notes11*Finding High-Support PairsBaskets(basket, item)SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s;
Notes11
CS 245Notes11*Finding High-Support PairsBaskets(basket, item)SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s;WHY?
Notes11
CS 245Notes11*Example
Notes11
Sheet1
basketitembasketitem1item2
t1p2t1p2p5
t1p5t1p2p8
t1p8t1p5p8
t2p5t2p5p8
t2p8t2p5p11
t2p11t2p8p11
...............
Sheet2
Sheet3
CS 245Notes11*Example
Notes11
Sheet1
basketitembasketitem1item2
t1p2t1p2p5
t1p5t1p2p8
t1p8t1p5p8
t2p5t2p5p8
t2p8t2p5p11
t2p11t2p8p11
...............
Sheet2
Sheet3
CS 245Notes11*IssuesPerformance for size 2 rulesbigevenbigger!Performance for size k rules
Notes11
Sheet1
basketitem
t1p2
t1p5
t1p8
t2p5
t2p8
t2p11
......
Sheet2
Sheet3
Sheet1
basketitembasketitem1item2
t1p2t1p2p5
t1p5t1p2p8
t1p8t1p5p8
t2p5t2p5p8
t2p8t2p5p11
t2p11t2p8p11
...............
Sheet2
Sheet3
CS 245Notes11*Association RulesHow do we perform rule mining efficiently?
Notes11
CS 245Notes11*Association RulesHow do we perform rule mining efficiently?Observation: If set X has support t, then each X subset must have at least support t
Notes11
CS 245Notes11*Association RulesHow do we perform rule mining efficiently?Observation: If set X has support t, then each X subset must have at least support tFor 2-sets:if we need support s for {i, j}then each i, j must appear in at least s baskets
Notes11
CS 245Notes11*Algorithm for 2-Sets(1) Find OK productsthose appearing in s or more baskets(2) Find high-support pairs using only OK products
Notes11
CS 245Notes11*Algorithm for 2-SetsINSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s;
Notes11
CS 245Notes11*Algorithm for 2-SetsINSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s;Perform mining on okBaskets SELECT I.item, J.item, COUNT(I.basket) FROM okBaskets I, okBaskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s;
Notes11
CS 245Notes11*Counting EfficientlyOne way:threshold = 3
Notes11
Sheet1
basketI.itemJ.itembasketitem1item2
t1p5p8t1p2p5
t2p5p8t1p2p8
t2p8p11t1p5p8
t3p2p3t2p5p8
t3p5p8t2p5p11
t3p2p8t2p8p11
..................
Sheet2
Sheet3
CS 245Notes11*Counting EfficientlyOne way:threshold = 3
Notes11
Sheet1
basketI.itemJ.itembasketitem1item2
t1p5p8t1p2p5
t2p5p8t1p2p8
t2p8p11t1p5p8
t3p2p3t2p5p8
t3p5p8t2p5p11
t3p2p8t2p8p11
..................
Sheet2
Sheet3
CS 245Notes11*Counting EfficientlyOne way:threshold = 3
Notes11
Sheet1
basketI.itemJ.itembasketitem1item2
t1p5p8t1p2p5
t2p5p8t1p2p8
t2p8p11t1p5p8
t3p2p3t2p5p8
t3p5p8t2p5p11
t3p2p8t2p8p11
..................
Sheet2
Sheet3
CS 245Notes11*Counting EfficientlyAnother way:threshold = 3
Notes11
Sheet1
basketI.itemJ.itembasketitem1item2
t1p5p8t1p2p5
t2p5p8t1p2p8
t2p8p11t1p5p8
t3p2p3t2p5p8
t3p5p8t2p5p11
t3p2p8t2p8p11
..................
Sheet2
Sheet3
CS 245Notes11*Counting EfficientlyAnother way:threshold = 3
Notes11
Sheet1
basketI.itemJ.itembasketitem1item2
t1p5p8t1p2p5
t2p5p8t1p2p8
t2p8p11t1p5p8
t3p2p3t2p5p8
t3p5p8t2p5p11
t3p2p8t2p8p11
..................
Sheet2
Sheet3
CS 245Notes11*Counting EfficientlyAnother way:threshold = 3
Notes11
Sheet1
basketI.itemJ.itembasketitem1item2
t1p5p8t1p2p5
t2p5p8t1p2p8
t2p8p11t1p5p8
t3p2p3t2p5p8
t3p5p8t2p5p11
t3p2p8t2p8p11
..................
Sheet2
Sheet3
CS 245Notes11*Counting EfficientlyAnother way:threshold = 3
Notes11
Sheet1
basketI.itemJ.itembasketitem1item2
t1p5p8t1p2p5
t2p5p8t1p2p8
t2p8p11t1p5p8
t3p2p3t2p5p8
t3p5p8t2p5p11
t3p2p8t2p8p11
..................
Sheet2
Sheet3
CS 245Notes11*Yet Another Waythreshold = 3
Notes11
Sheet1
basketI.itemJ.itembasketitem1item2
t1p5p8t1p2p5
t2p5p8t1p2p8
t2p8p11t1p5p8
t3p2p3t2p5p8
t3p5p8t2p5p11
t3p2p8t2p8p11
..................
Sheet2
Sheet3
CS 245Notes11*Yet Another Waythreshold = 3
Notes11
Sheet1
basketI.itemJ.itembasketitem1item2
t1p5p8t1p2p5
t2p5p8t1p2p8
t2p8p11t1p5p8
t3p2p3t2p5p8
t3p5p8t2p5p11
t3p2p8t2p8p11
..................
Sheet2
Sheet3
CS 245Notes11*Yet Another Waythreshold = 3false positive
Notes11
Sheet1
basketI.itemJ.itembasketitem1item2
t1p5p8t1p2p5
t2p5p8t1p2p8
t2p8p11t1p5p8
t3p2p3t2p5p8
t3p5p8t2p5p11
t3p2p8t2p8p11
..................
Sheet2
Sheet3
CS 245Notes11*Yet Another Waythreshold = 3false positive
Notes11
Sheet1
basketI.itemJ.itembasketitem1item2
t1p5p8t1p2p5
t2p5p8t1p2p8
t2p8p11t1p5p8
t3p2p3t2p5p8
t3p5p8t2p5p11
t3p2p8t2p8p11
..................
Sheet2
Sheet3
CS 245Notes11*Yet Another Waythreshold = 3false positive
Notes11
Sheet1
basketI.itemJ.itembasketitem1item2
t1p5p8t1p2p5
t2p5p8t1p2p8
t2p8p11t1p5p8
t3p2p3t2p5p8
t3p5p8t2p5p11
t3p2p8t2p8p11
..................
Sheet2
Sheet3
CS 245Notes11*DiscussionHashing scheme: 2 (or 3) scans of dataSorting scheme: requires a sort!Hashing works well if few high-support pairs and many low-support ones
Notes11
CS 245Notes11*DiscussionHashing scheme: 2 (or 3) scans of dataSorting scheme: requires a sort!Hashing works well if few high-support pairs and many low-support onesiceberg queries
Notes11
Implementation IssuesETL (Extraction, transformation, loading)Getting data to the warehouseEntity ResolutionWhat to materialize?Efficient AnalysisAssociation rule mining...
CS 245Notes11*
Notes11
Extra: Data Mining in the InfoLabCS 245Notes11*quartersRecommendations in CourseRank
userq1q2q3q4u1a: 5b: 5d: 5u2a: 1e: 2d: 4f: 3u3g: 4h: 2e: 3f: 3u4b: 2g: 4h: 4e: 4ua: 5g: 4e: 4
Notes11
Extra: Data Mining in the InfoLabCS 245Notes11*quartersRecommendations in CourseRanku3 and u4 are similar to uRecommend h
userq1q2q3q4u1a: 5b: 5d: 5u2a: 1e: 2d: 4f: 3u3g: 4h: 2e: 3f: 3u4b: 2g: 4h: 4e: 4ua: 5g: 4e: 4
Notes11
Extra: Data Mining in the InfoLabCS 245Notes11*quartersRecommendations in CourseRankRecommend d (and f, h)
userq1q2q3q4u1a: 5b: 5d: 5u2a: 1e: 2d: 4f: 3u3g: 4h: 2e: 3f: 3u4b: 2g: 4h: 4e: 4ua: 5g: 4e: 4
Notes11
Sequence MiningGiven a set of transcripts, use Pr[x|a] to predict if x is a good recommendation given user has taken a.Two issues...CS 245Notes11*
Notes11
Pr[x|a] Not Quite RightCS 245Notes11*Pr[x|a] = 2/3Pr[x|a~x] = 1/2target users transcript:[ ... a .... || unknown ]
recommend x?
transcriptcontaining1-2a3x4a -> x5x -> a
Notes11
User Has Taken >= 1 CourseUser has taken T= {a, b, c}Need Pr[x|T~x]Approximate as Pr[x|a~x b~x c~x ]Expensive to compute, so...
CS 245Notes11*
Notes11
CourseRank User StudyCS 245Notes11*good, expectedgood, unexpectedpercentage of ratings
Notes11
*******7****applies to persons, but also other kinds of entities
two-fold: 1 - identify records that represent the same entity2 - combine these entities
**