Download ppt - DW Stanford

Transcript
  • Data Warehousing OverviewCS245 Notes 11Hector Garcia-MolinaStanford UniversityCS 245*Notes11

    Notes11

  • CS 245Notes11*OutlineWhat is a data warehouse?Why a warehouse?Models & operationsImplementing a warehouse

    Notes11

  • CS 245Notes11*What is a Warehouse?Collection of diverse datasubject orientedaimed at executive, decision makeroften a copy of operational datawith value-added data (e.g., summaries, history)integratedtime-varyingnon-volatile

    Notes11

  • CS 245Notes11*What is a Warehouse?Collection of toolsgathering datacleansing, integrating, ...querying, reporting, analysisdata miningmonitoring, administering warehouse

    Notes11

  • CS 245Notes11*Warehouse ArchitectureMetadata

    Notes11

  • CS 245Notes11*Motivating ExamplesForecastingComparing performance of unitsMonitoring, detecting fraudVisualization

    Notes11

  • CS 245Notes11*Alternative to WarehousingTwo Approaches:Query-Driven (Lazy)Warehouse (Eager)

    Notes11

  • CS 245Notes11*Query-Driven Approach

    Notes11

  • CS 245Notes11*Advantages of WarehousingHigh query performanceQueries not visible outside warehouseLocal processing at sources unaffectedCan operate when sources unavailableCan query data not stored in a DBMSExtra information at warehouseModify, summarize (store aggregates)Add historical information

    Notes11

  • CS 245Notes11*Advantages of Query-DrivenNo need to copy dataless storageno need to purchase dataMore up-to-date dataQuery needs can be unknownOnly query interface needed at sourcesMay be less draining on sources

    Notes11

  • CS 245Notes11*Warehouse Models & OperatorsData ModelsrelationalcubesOperators

    Notes11

  • CS 245Notes11*Star

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    custId

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    custId

    prodId

    storeId

    qty

    amt

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    53.0

    p1

    c1

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    53.0

    p2

    c1

    111.0

    sally

    80 willow

    la

    c3

    la

    111.0

    p1

    c3

    product

    id

    name

    price

    p1

    bolt

    10.0

    p2

    nut

    5.0

    store

    code

    city

    c1

    nyc

    c2

    sfo

    c3

    la

    sale

    custId

    prodId

    storeId

    qty

    amt

    53.0

    p1

    c1

    53.0

    p2

    c1

    111.0

    p1

    c3

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    prodId

    name

    price

    store

    code

    city

    sale

    custId

    prodId

    storeId

    qty

    amt

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    53.0

    p1

    c1

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    53.0

    p2

    c1

    111.0

    sally

    80 willow

    la

    c3

    la

    111.0

    p1

    c3

    product

    id

    name

    price

    p1

    bolt

    10.0

    p2

    nut

    5.0

    store

    code

    city

    c1

    nyc

    c2

    sfo

    c3

    la

    sale

    custId

    prodId

    storeId

    qty

    amt

    53.0

    p1

    c1

    53.0

    p2

    c1

    111.0

    p1

    c3

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    storeId

    city

    sale

    custId

    prodId

    storeId

    qty

    amt

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    53.0

    p1

    c1

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    53.0

    p2

    c1

    111.0

    sally

    80 willow

    la

    c3

    la

    111.0

    p1

    c3

    product

    id

    name

    price

    p1

    bolt

    10.0

    p2

    nut

    5.0

    store

    code

    city

    c1

    nyc

    c2

    sfo

    c3

    la

    sale

    custId

    prodId

    storeId

    qty

    amt

    53.0

    p1

    c1

    53.0

    p2

    c1

    111.0

    p1

    c3

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    product

    id

    name

    price

    p1

    bolt

    10.0

    p2

    nut

    5.0

    store

    code

    city

    c1

    nyc

    c2

    sfo

    c3

    la

    sale

    custId

    prodId

    storeId

    qty

    amt

    53.0

    p1

    c1

    53.0

    p2

    c1

    111.0

    p1

    c3

  • CS 245Notes11*Star Schema

    Notes11

  • CS 245Notes11*TermsFact tableDimension tablesMeasures

    Notes11

  • CS 245Notes11*Dimension HierarchiesstoresTypecityregion snowflake schema constellations

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    storeId

    cityId

    tId

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    prodId

    storeId

    product

    id

    name

    price

    qty

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    regId

    region

    regId

    name

    sType

    tId

    size

    location

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    prodId

    storeId

    product

    id

    name

    price

    qty

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    prodId

    storeId

    product

    id

    name

    price

    qty

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    prodId

    storeId

    product

    id

    name

    price

    qty

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

  • CS 245Notes11*CubeFact table view:Multi-dimensional cube:dimensions = 2

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    12.0

    p1

    12.0

    50.0

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    11.0

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    50.0

    prodId

    p2

    c2

    8.0

    storeId

    product

    id

    name

    price

    qty

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    12.0

    p1

    12.0

    50.0

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    11.0

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    50.0

    prodId

    p2

    c2

    8.0

    storeId

    product

    id

    name

    price

    qty

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

  • CS 245Notes11*3-D Cubedimensions = 3Multi-dimensional cube:Fact table view:

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    12.0

    50.0

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

  • OperatorsTraditionalselectionaggregation...Analysisclean datafind trends...CS 245Notes11*Relational

    Cube

    Notes11

  • CS 245Notes11*Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 181

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    12.0

    50.0

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

  • CS 245Notes11*Aggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    12.0

    50.0

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

  • CS 245Notes11*Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodIddrill-downrollup

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    12.0

    50.0

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

  • CS 245Notes11*AggregatesOperators: sum, count, max, min, median, aveHaving clauseUsing dimension hierarchyaverage by region (within store)maximum by month (within date)

    Notes11

  • CS 245Notes11*Cube Aggregationday 2day 1129. . .Example: computing sums

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    12.0

    50.0

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

    Sheet1

    customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtc1c2c3

    53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112sum671250

    111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150

    prodIdp2c218

    storeIdp1c1244

    productidnamepriceqtyp1c224

    p1bolt10amt

    p2nut5

    customer

    id

    storecodecityname

    c1nycaddress

    c2sfocity

    c3la

    product

    id

    salecustIdprodIdstoreIdqtyamtname

    53p1c1112price

    53p2c1211

    111p1c3550store

    code

    city

    &A

    Page &P

    Sheet2

    &A

    Page &P

    Sheet3

    &A

    Page &P

    Sheet4

    &A

    Page &P

    Sheet5

    &A

    Page &P

    Sheet6

    &A

    Page &P

    Sheet7

    &A

    Page &P

    Sheet8

    &A

    Page &P

    Sheet9

    &A

    Page &P

    Sheet10

    &A

    Page &P

    Sheet11

    &A

    Page &P

    Sheet12

    &A

    Page &P

    Sheet13

    &A

    Page &P

    Sheet14

    &A

    Page &P

    Sheet15

    &A

    Page &P

    Sheet16

    &A

    Page &P

    Sheet1

    customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtsum

    53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112p1110

    81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsp2c1111p219

    111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150

    prodIdp2c218

    storeIdp1c1244

    productidnamepriceqtyp1c224

    p1bolt10amt

    p2nut5

    customer

    id

    storecodecityname

    c1nycaddress

    c2sfocity

    c3la

    product

    id

    salecustIdprodIdstoreIdqtyamtname

    53p1c1112price

    53p2c1211

    111p1c3550store

    code

    city

    &A

    Page &P

    Sheet2

    &A

    Page &P

    Sheet3

    &A

    Page &P

    Sheet4

    &A

    Page &P

    Sheet5

    &A

    Page &P

    Sheet6

    &A

    Page &P

    Sheet7

    &A

    Page &P

    Sheet8

    &A

    Page &P

    Sheet9

    &A

    Page &P

    Sheet10

    &A

    Page &P

    Sheet11

    &A

    Page &P

    Sheet12

    &A

    Page &P

    Sheet13

    &A

    Page &P

    Sheet14

    &A

    Page &P

    Sheet15

    &A

    Page &P

    Sheet16

    &A

    Page &P

  • CS 245Notes11*Cube Operatorsday 2day 1129. . .sale(c1,*,*)sale(*,*,*)sale(c2,p2,*)

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    12.0

    50.0

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

    Sheet1

    customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtc1c2c3

    53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112sum671250

    111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150

    prodIdp2c218

    storeIdp1c1244

    productidnamepriceqtyp1c224

    p1bolt10amt

    p2nut5

    customer

    id

    storecodecityname

    c1nycaddress

    c2sfocity

    c3la

    product

    id

    salecustIdprodIdstoreIdqtyamtname

    53p1c1112price

    53p2c1211

    111p1c3550store

    code

    city

    &A

    Page &P

    Sheet2

    &A

    Page &P

    Sheet3

    &A

    Page &P

    Sheet4

    &A

    Page &P

    Sheet5

    &A

    Page &P

    Sheet6

    &A

    Page &P

    Sheet7

    &A

    Page &P

    Sheet8

    &A

    Page &P

    Sheet9

    &A

    Page &P

    Sheet10

    &A

    Page &P

    Sheet11

    &A

    Page &P

    Sheet12

    &A

    Page &P

    Sheet13

    &A

    Page &P

    Sheet14

    &A

    Page &P

    Sheet15

    &A

    Page &P

    Sheet16

    &A

    Page &P

    Sheet1

    customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtsum

    53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112p1110

    81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsp2c1111p219

    111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150

    prodIdp2c218

    storeIdp1c1244

    productidnamepriceqtyp1c224

    p1bolt10amt

    p2nut5

    customer

    id

    storecodecityname

    c1nycaddress

    c2sfocity

    c3la

    product

    id

    salecustIdprodIdstoreIdqtyamtname

    53p1c1112price

    53p2c1211

    111p1c3550store

    code

    city

    &A

    Page &P

    Sheet2

    &A

    Page &P

    Sheet3

    &A

    Page &P

    Sheet4

    &A

    Page &P

    Sheet5

    &A

    Page &P

    Sheet6

    &A

    Page &P

    Sheet7

    &A

    Page &P

    Sheet8

    &A

    Page &P

    Sheet9

    &A

    Page &P

    Sheet10

    &A

    Page &P

    Sheet11

    &A

    Page &P

    Sheet12

    &A

    Page &P

    Sheet13

    &A

    Page &P

    Sheet14

    &A

    Page &P

    Sheet15

    &A

    Page &P

    Sheet16

    &A

    Page &P

  • CS 245Notes11*Extended Cubeday 2day 1*sale(*,p2,*)

    Notes11

    Sheet1

    customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtc1c2c3*

    53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112p156450110

    81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsp2c1111p211819

    111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150*671250129

    prodIdp2c218

    storeIdp1c1244

    productidnamepriceqtyp1c224

    p1bolt10amt

    p2nut5

    customer

    id

    storecodecityname

    c1nycaddress

    c2sfocity

    c3la

    product

    id

    salecustIdprodIdstoreIdqtyamtname

    53p1c1112price

    53p2c1211

    111p1c3550store

    code

    city

    &A

    Page &P

    Sheet2

    &A

    Page &P

    Sheet3

    &A

    Page &P

    Sheet4

    &A

    Page &P

    Sheet5

    &A

    Page &P

    Sheet6

    &A

    Page &P

    Sheet7

    &A

    Page &P

    Sheet8

    &A

    Page &P

    Sheet9

    &A

    Page &P

    Sheet10

    &A

    Page &P

    Sheet11

    &A

    Page &P

    Sheet12

    &A

    Page &P

    Sheet13

    &A

    Page &P

    Sheet14

    &A

    Page &P

    Sheet15

    &A

    Page &P

    Sheet16

    &A

    Page &P

    Sheet1

    customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtc1c2c3*

    53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112p144448

    81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsp2c1111p2

    111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150*44448

    prodIdp2c218

    storeIdp1c1244

    productidnamepriceqtyp1c224

    p1bolt10amt

    p2nut5

    customer

    id

    storecodecityname

    c1nycaddress

    c2sfocity

    c3la

    product

    id

    salecustIdprodIdstoreIdqtyamtname

    53p1c1112price

    53p2c1211

    111p1c3550store

    code

    city

    &A

    Page &P

    Sheet2

    &A

    Page &P

    Sheet3

    &A

    Page &P

    Sheet4

    &A

    Page &P

    Sheet5

    &A

    Page &P

    Sheet6

    &A

    Page &P

    Sheet7

    &A

    Page &P

    Sheet8

    &A

    Page &P

    Sheet9

    &A

    Page &P

    Sheet10

    &A

    Page &P

    Sheet11

    &A

    Page &P

    Sheet12

    &A

    Page &P

    Sheet13

    &A

    Page &P

    Sheet14

    &A

    Page &P

    Sheet15

    &A

    Page &P

    Sheet16

    &A

    Page &P

    Sheet1

    customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtc1c2c3*

    53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112p1125062

    81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsp2c1111p211819

    111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150*2385081

    prodIdp2c218

    storeIdp1c1244

    productidnamepriceqtyp1c224

    p1bolt10amt

    p2nut5

    customer

    id

    storecodecityname

    c1nycaddress

    c2sfocity

    c3la

    product

    id

    salecustIdprodIdstoreIdqtyamtname

    53p1c1112price

    53p2c1211

    111p1c3550store

    code

    city

    &A

    Page &P

    Sheet2

    &A

    Page &P

    Sheet3

    &A

    Page &P

    Sheet4

    &A

    Page &P

    Sheet5

    &A

    Page &P

    Sheet6

    &A

    Page &P

    Sheet7

    &A

    Page &P

    Sheet8

    &A

    Page &P

    Sheet9

    &A

    Page &P

    Sheet10

    &A

    Page &P

    Sheet11

    &A

    Page &P

    Sheet12

    &A

    Page &P

    Sheet13

    &A

    Page &P

    Sheet14

    &A

    Page &P

    Sheet15

    &A

    Page &P

    Sheet16

    &A

    Page &P

  • CS 245Notes11*Aggregation Using Hierarchiescustomerregioncountry(customer c1 in Region A;customers c2, c3 in Region B)

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    region A

    region B

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

  • CS 245Notes11*Data AnalysisDecision TreesClusteringAssociation Rules

    Notes11

  • CS 245Notes11*Decision TreesExample: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaigntrainingset

    Notes11

    Sheet1

    customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsalecustIdcaragecitynewCarc1c2c3

    53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownc1taurus27sfyesp11250

    81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsc2van35layesp2118

    111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyc3van40sfyes

    prodIdc4taurus22sfyes

    storeIdc5merc50lano

    productidnamepriceqtyc6taurus25lano

    p1bolt10amt

    p2nut5

    customer

    id

    storecodecityname

    c1nycaddress

    c2sfocity

    c3la

    product

    id

    salecustIdprodIdstoreIdqtyamtname

    53p1c1112price

    53p2c1211

    111p1c3550store

    code

    city

    &A

    Page &P

    Sheet2

    &A

    Page &P

    Sheet3

    &A

    Page &P

    Sheet4

    &A

    Page &P

    Sheet5

    &A

    Page &P

    Sheet6

    &A

    Page &P

    Sheet7

    &A

    Page &P

    Sheet8

    &A

    Page &P

    Sheet9

    &A

    Page &P

    Sheet10

    &A

    Page &P

    Sheet11

    &A

    Page &P

    Sheet12

    &A

    Page &P

    Sheet13

    &A

    Page &P

    Sheet14

    &A

    Page &P

    Sheet15

    &A

    Page &P

    Sheet16

    &A

    Page &P

  • CS 245Notes11*One Possibilityage
  • CS 245Notes11*Another Possibilitycar=tauruscity=sfage
  • CS 245Notes11*IssuesDecision tree cannot be too deepwould not have statistically significant amounts of data for lower decisionsNeed to select tree that most reliably predicts outcomes

    Notes11

  • CS 245Notes11*Clusteringageincomeeducation

    Notes11

  • CS 245Notes11*Clusteringageincomeeducation

    Notes11

  • CS 245Notes11*Another Example: TextEach document is a vectore.g., contains words 1,4,5,...Clusters contain similar documentsUseful for understanding, searching documentsinternationalnewssportsbusiness

    Notes11

  • CS 245Notes11*IssuesGiven desired number of clusters?Finding best clustersAre clusters semantically meaningful?e.g., yuppies cluster?Using clusters for disk storage

    Notes11

  • CS 245Notes11*Association Rule Miningtransactionidcustomeridproductsboughtsalesrecords: Trend: Products p5, p8 often bough together Trend: Customer 12 likes product p9market-basketdata

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    tran1

    cust33

    p2, p5, p8

    tran2

    cust45

    p5, p8, p11

    tran3

    cust12

    p1, p9

    tran4

    cust40

    p5, p8, p11

    tran5

    cust12

    p2, p9

    tran6

    cust12

    p9

  • CS 245Notes11*Association RuleRule: {p1, p3, p8}Support: number of baskets where these products appearHigh-support set: support threshold sProblem: find all high support sets

    Notes11

  • Implementation IssuesETL (Extraction, transformation, loading)Getting data to the warehouseEntity ResolutionWhat to materialize?Efficient AnalysisAssociation rule mining...

    CS 245Notes11*

    Notes11

  • CS 245Notes11*ETL: Monitoring TechniquesPeriodic snapshotsDatabase triggersLog shippingData shipping (replication service)Transaction shippingPolling (queries to source)Screen scrapingApplication level monitoring Advantages & Disadvantages!!

    Notes11

  • CS 245Notes11*ETL: Data CleaningMigration (e.g., yen dollars)Scrubbing: use domain-specific knowledge (e.g., social security numbers)Fusion (e.g., mail list, customer merging)

    Auditing: discover rules & relationships (like data mining)

    Notes11

  • *More details: Entity ResolutionN: a A: b CC#: c Ph: ee1N: a Exp: d Ph: ee2

  • *Applicationscomparison shoppingmailing listsclassified adscustomer filescounter-terrorismN: a A: b CC#: c Ph: ee1N: a Exp: d Ph: ee2

  • Why is ER Challenging?Huge data setsNo unique identifiersLots of uncertaintyMany ways to skin the cat*

  • Taxonomy: Pairwise vs GlobalDecide if r, s match only by looking at r, s?Or need to consider more (all) records?*Nm: Pat SmithAd: 123 Main StPh: (650) 555-1212Nm: Patrick SmithAd: 132 Main StPh: (650) 555-1212Nm: Patricia SmithAd: 123 Main StPh: (650) 777-1111or

  • Taxonomy: Pairwise vs GlobalGlobal matching complicates things a lot!e.g., change decision as new records arrive*Nm: Pat SmithAd: 123 Main StPh: (650) 555-1212Nm: Patrick SmithAd: 132 Main StPh: (650) 555-1212Nm: Patricia SmithAd: 123 Main StPh: (650) 777-1111or

  • Taxonomy: OutcomePartition of recordse.g., comparison shoppingMerged records*Nm: Pat SmithAd: 123 Main StPh: (650) 555-1212Nm: Patricia SmithAd: 123 Main StPh: (650) 555-1212 (650) 777-1111Hair: BlackNm: Patricia SmithAd: 132 Main StPh: (650) 777-1111Hair: Black

  • *Taxonomy: OutcomeIterate after mergingNm: TomWk: IBMOc: laywerSal: 500KNm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMNm: ThomasAd: 123 MaimOc: lawyerNm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerNm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K

  • Taxonomy: Record ReuseOne record related to multiple entities?*Nm: Pat Smith Sr.Ph: (650) 555-1212Ph: (650) 555-1212Ad: 123 Main StNm: Pat Smith Jr.Ph: (650) 555-1212Nm: Pat Smith Sr.Ph: (650) 555-1212Ad: 123 Main StNm: Pat Smith Jr.Ph: (650) 555-1212Ad: 123 Main St

  • Taxonomy: Record ReusePartitions*Mergesr s trstrsst

  • Taxonomy: Record ReusePartitions*Mergesr s trstrsst Record reuse complex and expensive!

  • *Taxonomy: Multiple Entity Typesbrothermemberbusinessmember

  • *Taxonomy: Multiple Entity Typesauthorspaperssame??

  • *Taxonomy: Exact vs ApproximateproductscamerasresolvedcamerasCDsbooks...resolvedCDsresolvedbooks...ERERER

  • *Taxonomy: Exact vs Approximateterroriststerroristssortby ageB Cooper 30match againstages 25-35

  • Implementation IssuesETL (Extraction, transformation, loading)Getting data to the warehouseEntity ResolutionWhat to materialize?Efficient AnalysisAssociation rule mining...

    CS 245Notes11*

    Notes11

  • CS 245Notes11*What to Materialize?Store in warehouse results useful for common queriesExample:day 2day 1129. . .total salesmaterialize

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    12.0

    50.0

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    110.0

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    19.0

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

  • CS 245Notes11*Materialization FactorsType/frequency of queriesQuery response timeStorage costUpdate cost

    Notes11

  • CS 245Notes11*Cube Aggregates Latticecity, product, datecity, productcity, dateproduct, datecityproductdateall129use greedyalgorithm todecide whatto materialize

    Notes11

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    81.0

    fred

    12 main

    sfo

    p2

    nut

    5.0

    c2

    sfo

    o102

    2/7/97

    53.0

    p2

    c1

    date

    s7

    sfo

    t2

    fred

    la

    5M

    south

    south

    warm region

    t2

    large

    suburbs

    p2

    c1

    p2

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

    Sheet: Sheet1

    Sheet: Sheet2

    Sheet: Sheet3

    Sheet: Sheet4

    Sheet: Sheet5

    Sheet: Sheet6

    Sheet: Sheet7

    Sheet: Sheet8

    Sheet: Sheet9

    Sheet: Sheet10

    Sheet: Sheet11

    Sheet: Sheet12

    Sheet: Sheet13

    Sheet: Sheet14

    Sheet: Sheet15

    Sheet: Sheet16

    customer

    id

    name

    address

    city

    product

    id

    name

    price

    store

    code

    city

    sale

    oderId

    date

    custId

    prodId

    storeId

    qty

    amt

    sale

    store

    code

    city

    type

    mgr

    city

    cityId

    pop

    region

    region

    regId

    name

    sType

    tId

    size

    location

    sale

    prodId

    storeId

    date

    amt

    c1

    c2

    c3

    53.0

    joe

    10 main

    sfo

    p1

    bolt

    10.0

    c1

    nyc

    o100

    1/7/97

    53.0

    p1

    c1

    orderId

    s5

    sfo

    t1

    joe

    sfo

    1M

    north

    north

    cold region

    t1

    small

    downtown

    p1

    c1

    p1

    111.0

    sally

    80 willow

    la

    c3

    la

    105.0

    3/8/97

    111.0

    p1

    c3

    custId

    s9

    la

    t1

    nancy

    p1

    c3

    prodId

    p2

    c2

    storeId

    p1

    c1

    product

    id

    name

    price

    qty

    p1

    c2

    p1

    bolt

    10.0

    amt

    p2

    nut

    5.0

    customer

    id

    store

    code

    city

    name

    c1

    nyc

    address

    c2

    sfo

    city

    c3

    la

    product

    id

    sale

    custId

    prodId

    storeId

    qty

    amt

    name

    53.0

    p1

    c1

    price

    53.0

    p2

    c1

    111.0

    p1

    c3

    store

    code

    city

  • CS 245Notes11*Dimension Hierarchiesallstatecity

    Notes11

    Sheet1

    customeridnameaddresscityproductidnamepricestorecodecitysaleoderIddatecustIdprodIdstoreIdqtyamtsalestorecodecitytypemgrcitycityIdpopregionregionregIdnamesTypetIdsizelocationsaleprodIdstoreIddateamtcitiescitystate

    53joe10 mainsfop1bolt10c1nyco1001/7/9753p1c1112orderIds5sfot1joesfo1Mnorthnorthcold regiont1smalldowntownp1c1112c1CA

    81fred12 mainsfop2nut5c2sfoo1022/7/9753p2c1211dates7sfot2fredla5Msouthsouthwarm regiont2largesuburbsp2c1111c2NY

    111sally80 willowlac3la1053/8/97111p1c3550custIds9lat1nancyp1c3150

    prodIdp2c218

    storeIdp1c1244

    productidnamepriceqtyp1c224

    p1bolt10amt

    p2nut5

    customer

    id

    storecodecityname

    c1nycaddress

    c2sfocity

    c3la

    product

    id

    salecustIdprodIdstoreIdqtyamtname

    53p1c1112price

    53p2c1211

    111p1c3550store

    code

    city

    &A

    Page &P

    Sheet2

    &A

    Page &P

    Sheet3

    &A

    Page &P

    Sheet4

    &A

    Page &P

    Sheet5

    &A

    Page &P

    Sheet6

    &A

    Page &P

    Sheet7

    &A

    Page &P

    Sheet8

    &A

    Page &P

    Sheet9

    &A

    Page &P

    Sheet10

    &A

    Page &P

    Sheet11

    &A

    Page &P

    Sheet12

    &A

    Page &P

    Sheet13

    &A

    Page &P

    Sheet14

    &A

    Page &P

    Sheet15

    &A

    Page &P

    Sheet16

    &A

    Page &P

  • CS 245Notes11*Dimension Hierarchiescity, productcity, product, datecity, dateproduct, datecityproductdateallstate, product, datestate, datestate, productstatenot all arcs shown...

    Notes11

  • CS 245Notes11*Interesting Hierarchyallyearsquartersmonthsdaysweeksconceptualdimension table

    Notes11

    Sheet1

    customeridnameaddresscityproductidnamepricestorecodecitytimedayweekmonthquarteryear

    53joe10 mainsfop1bolt10c1nyc11112000

    81fred12 mainsfop2nut5c2sfo21112000

    111sally80 willowlac3la31112000

    41112000

    51112000

    productidnameprice61112000

    p1bolt1071112000

    p2nut582112000

    storecodecity

    c1nyc

    c2sfo

    c3la

    salecustIdprodIdstoreIdqtyamt

    53p1c1112

    53p2c1211

    111p1c3550

    &A

    Page &P

    Sheet2

    &A

    Page &P

    Sheet3

    &A

    Page &P

    Sheet4

    &A

    Page &P

    Sheet5

    &A

    Page &P

    Sheet6

    &A

    Page &P

    Sheet7

    &A

    Page &P

    Sheet8

    &A

    Page &P

    Sheet9

    &A

    Page &P

    Sheet10

    &A

    Page &P

    Sheet11

    &A

    Page &P

    Sheet12

    &A

    Page &P

    Sheet13

    &A

    Page &P

    Sheet14

    &A

    Page &P

    Sheet15

    &A

    Page &P

    Sheet16

    &A

    Page &P

  • Implementation IssuesETL (Extraction, transformation, loading)Getting data to the warehouseEntity ResolutionWhat to materialize?Efficient AnalysisAssociation rule mining...

    CS 245Notes11*

    Notes11

  • CS 245Notes11*Finding High-Support PairsBaskets(basket, item)SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s;

    Notes11

  • CS 245Notes11*Finding High-Support PairsBaskets(basket, item)SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s;WHY?

    Notes11

  • CS 245Notes11*Example

    Notes11

    Sheet1

    basketitembasketitem1item2

    t1p2t1p2p5

    t1p5t1p2p8

    t1p8t1p5p8

    t2p5t2p5p8

    t2p8t2p5p11

    t2p11t2p8p11

    ...............

    Sheet2

    Sheet3

  • CS 245Notes11*Example

    Notes11

    Sheet1

    basketitembasketitem1item2

    t1p2t1p2p5

    t1p5t1p2p8

    t1p8t1p5p8

    t2p5t2p5p8

    t2p8t2p5p11

    t2p11t2p8p11

    ...............

    Sheet2

    Sheet3

  • CS 245Notes11*IssuesPerformance for size 2 rulesbigevenbigger!Performance for size k rules

    Notes11

    Sheet1

    basketitem

    t1p2

    t1p5

    t1p8

    t2p5

    t2p8

    t2p11

    ......

    Sheet2

    Sheet3

    Sheet1

    basketitembasketitem1item2

    t1p2t1p2p5

    t1p5t1p2p8

    t1p8t1p5p8

    t2p5t2p5p8

    t2p8t2p5p11

    t2p11t2p8p11

    ...............

    Sheet2

    Sheet3

  • CS 245Notes11*Association RulesHow do we perform rule mining efficiently?

    Notes11

  • CS 245Notes11*Association RulesHow do we perform rule mining efficiently?Observation: If set X has support t, then each X subset must have at least support t

    Notes11

  • CS 245Notes11*Association RulesHow do we perform rule mining efficiently?Observation: If set X has support t, then each X subset must have at least support tFor 2-sets:if we need support s for {i, j}then each i, j must appear in at least s baskets

    Notes11

  • CS 245Notes11*Algorithm for 2-Sets(1) Find OK productsthose appearing in s or more baskets(2) Find high-support pairs using only OK products

    Notes11

  • CS 245Notes11*Algorithm for 2-SetsINSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s;

    Notes11

  • CS 245Notes11*Algorithm for 2-SetsINSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s;Perform mining on okBaskets SELECT I.item, J.item, COUNT(I.basket) FROM okBaskets I, okBaskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s;

    Notes11

  • CS 245Notes11*Counting EfficientlyOne way:threshold = 3

    Notes11

    Sheet1

    basketI.itemJ.itembasketitem1item2

    t1p5p8t1p2p5

    t2p5p8t1p2p8

    t2p8p11t1p5p8

    t3p2p3t2p5p8

    t3p5p8t2p5p11

    t3p2p8t2p8p11

    ..................

    Sheet2

    Sheet3

  • CS 245Notes11*Counting EfficientlyOne way:threshold = 3

    Notes11

    Sheet1

    basketI.itemJ.itembasketitem1item2

    t1p5p8t1p2p5

    t2p5p8t1p2p8

    t2p8p11t1p5p8

    t3p2p3t2p5p8

    t3p5p8t2p5p11

    t3p2p8t2p8p11

    ..................

    Sheet2

    Sheet3

  • CS 245Notes11*Counting EfficientlyOne way:threshold = 3

    Notes11

    Sheet1

    basketI.itemJ.itembasketitem1item2

    t1p5p8t1p2p5

    t2p5p8t1p2p8

    t2p8p11t1p5p8

    t3p2p3t2p5p8

    t3p5p8t2p5p11

    t3p2p8t2p8p11

    ..................

    Sheet2

    Sheet3

  • CS 245Notes11*Counting EfficientlyAnother way:threshold = 3

    Notes11

    Sheet1

    basketI.itemJ.itembasketitem1item2

    t1p5p8t1p2p5

    t2p5p8t1p2p8

    t2p8p11t1p5p8

    t3p2p3t2p5p8

    t3p5p8t2p5p11

    t3p2p8t2p8p11

    ..................

    Sheet2

    Sheet3

  • CS 245Notes11*Counting EfficientlyAnother way:threshold = 3

    Notes11

    Sheet1

    basketI.itemJ.itembasketitem1item2

    t1p5p8t1p2p5

    t2p5p8t1p2p8

    t2p8p11t1p5p8

    t3p2p3t2p5p8

    t3p5p8t2p5p11

    t3p2p8t2p8p11

    ..................

    Sheet2

    Sheet3

  • CS 245Notes11*Counting EfficientlyAnother way:threshold = 3

    Notes11

    Sheet1

    basketI.itemJ.itembasketitem1item2

    t1p5p8t1p2p5

    t2p5p8t1p2p8

    t2p8p11t1p5p8

    t3p2p3t2p5p8

    t3p5p8t2p5p11

    t3p2p8t2p8p11

    ..................

    Sheet2

    Sheet3

  • CS 245Notes11*Counting EfficientlyAnother way:threshold = 3

    Notes11

    Sheet1

    basketI.itemJ.itembasketitem1item2

    t1p5p8t1p2p5

    t2p5p8t1p2p8

    t2p8p11t1p5p8

    t3p2p3t2p5p8

    t3p5p8t2p5p11

    t3p2p8t2p8p11

    ..................

    Sheet2

    Sheet3

  • CS 245Notes11*Yet Another Waythreshold = 3

    Notes11

    Sheet1

    basketI.itemJ.itembasketitem1item2

    t1p5p8t1p2p5

    t2p5p8t1p2p8

    t2p8p11t1p5p8

    t3p2p3t2p5p8

    t3p5p8t2p5p11

    t3p2p8t2p8p11

    ..................

    Sheet2

    Sheet3

  • CS 245Notes11*Yet Another Waythreshold = 3

    Notes11

    Sheet1

    basketI.itemJ.itembasketitem1item2

    t1p5p8t1p2p5

    t2p5p8t1p2p8

    t2p8p11t1p5p8

    t3p2p3t2p5p8

    t3p5p8t2p5p11

    t3p2p8t2p8p11

    ..................

    Sheet2

    Sheet3

  • CS 245Notes11*Yet Another Waythreshold = 3false positive

    Notes11

    Sheet1

    basketI.itemJ.itembasketitem1item2

    t1p5p8t1p2p5

    t2p5p8t1p2p8

    t2p8p11t1p5p8

    t3p2p3t2p5p8

    t3p5p8t2p5p11

    t3p2p8t2p8p11

    ..................

    Sheet2

    Sheet3

  • CS 245Notes11*Yet Another Waythreshold = 3false positive

    Notes11

    Sheet1

    basketI.itemJ.itembasketitem1item2

    t1p5p8t1p2p5

    t2p5p8t1p2p8

    t2p8p11t1p5p8

    t3p2p3t2p5p8

    t3p5p8t2p5p11

    t3p2p8t2p8p11

    ..................

    Sheet2

    Sheet3

  • CS 245Notes11*Yet Another Waythreshold = 3false positive

    Notes11

    Sheet1

    basketI.itemJ.itembasketitem1item2

    t1p5p8t1p2p5

    t2p5p8t1p2p8

    t2p8p11t1p5p8

    t3p2p3t2p5p8

    t3p5p8t2p5p11

    t3p2p8t2p8p11

    ..................

    Sheet2

    Sheet3

  • CS 245Notes11*DiscussionHashing scheme: 2 (or 3) scans of dataSorting scheme: requires a sort!Hashing works well if few high-support pairs and many low-support ones

    Notes11

  • CS 245Notes11*DiscussionHashing scheme: 2 (or 3) scans of dataSorting scheme: requires a sort!Hashing works well if few high-support pairs and many low-support onesiceberg queries

    Notes11

  • Implementation IssuesETL (Extraction, transformation, loading)Getting data to the warehouseEntity ResolutionWhat to materialize?Efficient AnalysisAssociation rule mining...

    CS 245Notes11*

    Notes11

  • Extra: Data Mining in the InfoLabCS 245Notes11*quartersRecommendations in CourseRank

    userq1q2q3q4u1a: 5b: 5d: 5u2a: 1e: 2d: 4f: 3u3g: 4h: 2e: 3f: 3u4b: 2g: 4h: 4e: 4ua: 5g: 4e: 4

    Notes11

  • Extra: Data Mining in the InfoLabCS 245Notes11*quartersRecommendations in CourseRanku3 and u4 are similar to uRecommend h

    userq1q2q3q4u1a: 5b: 5d: 5u2a: 1e: 2d: 4f: 3u3g: 4h: 2e: 3f: 3u4b: 2g: 4h: 4e: 4ua: 5g: 4e: 4

    Notes11

  • Extra: Data Mining in the InfoLabCS 245Notes11*quartersRecommendations in CourseRankRecommend d (and f, h)

    userq1q2q3q4u1a: 5b: 5d: 5u2a: 1e: 2d: 4f: 3u3g: 4h: 2e: 3f: 3u4b: 2g: 4h: 4e: 4ua: 5g: 4e: 4

    Notes11

  • Sequence MiningGiven a set of transcripts, use Pr[x|a] to predict if x is a good recommendation given user has taken a.Two issues...CS 245Notes11*

    Notes11

  • Pr[x|a] Not Quite RightCS 245Notes11*Pr[x|a] = 2/3Pr[x|a~x] = 1/2target users transcript:[ ... a .... || unknown ]

    recommend x?

    transcriptcontaining1-2a3x4a -> x5x -> a

    Notes11

  • User Has Taken >= 1 CourseUser has taken T= {a, b, c}Need Pr[x|T~x]Approximate as Pr[x|a~x b~x c~x ]Expensive to compute, so...

    CS 245Notes11*

    Notes11

  • CourseRank User StudyCS 245Notes11*good, expectedgood, unexpectedpercentage of ratings

    Notes11

    *******7****applies to persons, but also other kinds of entities

    two-fold: 1 - identify records that represent the same entity2 - combine these entities

    **


Recommended