61
CompSci 516 Database Systems Lecture 23 Data Cube and Data Mining Instructor: Sudeepa Roy Duke CS, Fall 2018 CompSci 516: Database Systems 1

Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

CompSci 516DatabaseSystems

Lecture23DataCube

andDataMining

Instructor:Sudeepa Roy

DukeCS,Fall2018 CompSci516:DatabaseSystems 1

Page 2: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Announcements

• HW3dueonNov30(Fri),5pm

• FinalreportdueonDec12,Wed– Seefeedbackonpiazza

• Pleasefilloutthecourseevaluations!–Weneedtohearfromeachofyou–Weneedyourfeedback/suggestionstoimprovethisclass

DukeCS,Fall2018 CompSci516:DatabaseSystems 2

Page 3: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Announcements• Classpresentationnextlecture(Thurs,Nov29)– Intheorderofyourgroupnumber– 4minpresentation(willbetimed!)

• Presentation– 14projectsin75mins– 4minsperproject!– Noteveryonehastopresent(uptoyou)

• everyoneinagroupgetsthesamegrade– Youpresentthecurrentstatusoftheproject

• problem,example,yourapproach,whatyouplan– Besttoshowplots/screenshots/results/demo!– Trytoshowthemostinterestingobservation/findingsin4mins!

– Telluswhatyouwanttodobeforeyousubmitthefinalreport(ifanything)

DukeCS,Fall2018 CompSci516:DatabaseSystems 3

Page 4: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

DataWarehousing

DukeCS,Fall2018 CompSci516:DatabaseSystems 4

Page 5: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

ReadingMaterial

• [RG]– Chapter25

• Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh,ICDE1996“DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals”

• Harinarayan-Rajaraman-Ullman,SIGMOD1996“Implementingdatacubesefficiently”

DukeCS,Fall2018 CompSci516:DatabaseSystems 5

Acknowledgement:• Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.RamakrishnanandDr.Gehrke.• SomeslideshavebeenpreparedbyProf.Shivnath Babu

Page 6: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

6

Warehousing

• Growingindustry:$8billionwaybackin1998• DatawarehousevendorlikeTeradata

• big“Petabytescale”customers• Apple,Walmart(2008-2.5PB),eBay(2013-primaryDW9.2PB,otherbigdata40PB,singletablewith1trillionrows),Verizon,AT&T,BankofAmerica

• supportsdataintoandoutofHadoop

• Lotsofbuzzwords,hype– slice&dice,rollup,MOLAP,pivot,...

Ack:SlidebyProf.Shivnath Babu

https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/

DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 7: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Introduction• Organizationsanalyzecurrentandhistoricaldata

– toidentifyusefulpatterns– tosupportbusinessstrategies

• Emphasisisoncomplex,interactive,exploratoryanalysisofverylargedatasets

• Createdbyintegratingdatafromacrossallpartsofanenterprise

• Dataisfairlystatic

• Relevantonceagainfortherecent“BigDataanalysis”– tofigureoutwhatwecanreuse,whatwecannot

DukeCS,Fall2018 CompSci516:DatabaseSystems 7

Page 8: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

OLTP DataWarehousing/OLAPMostlyupdates Mostlyreads

Applications:Orderentry,salesupdate,bankingtransactions

Applications:Decisionsupportinindustry/organization

Detailed,up-to-datedata Summarized, historicaldata(frommultipleoperationaldb,growsovertime)

Structured,repetitive,shorttasks Queryintensive,adhoc,complexqueries

Eachtransactionreads/updatesonlyafewtuples(tensof)

Eachquerycanaccesses manyrecords,andperformmanyjoins,scans,aggregates

MB-GBdata GB-TB data

Typicallyclericalusers Decisionmakers,analysts asusers

Important:Consistency,recoverability,Maximizingtr.throughput

Important:QuerythroughputResponse times

8DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 9: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

ThreeComplementaryTrends

• DataWarehousing(DW):– Consolidatedatafrommanysourcesinonelargerepository– Loading,periodicsynchronizationofreplicas– Semanticintegration

• OLAP:– ComplexSQLqueriesandviews.– Queriesbasedonspreadsheet-styleoperationsand“multidimensional”viewofdata.

– Interactiveand“online”queries.

• DataMining:– Exploratorysearchforinterestingtrendsandanomalies

DukeCS,Fall2018 CompSci516:DatabaseSystems 9

Page 10: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

ROLAPandMOLAP

• RelationalOLAP(ROLAP)– OntopofstandardrelationalDBMS– DataisstoredinrelationalDBMS– SupportsextensionstoSQLtoaccessmulti-dimensionaldata

• MultidimensionalOLAP(MOLAP)– Directlystoresmultidimensionaldatainspecialdatastructures(e.g.arrays)

10DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 11: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

ROLAP:StarSchema

• Toreflectmulti-dimensionalviewsofdata

• Singlefacttable

• Singletableforeverydimension

• Eachtupleinthefacttableconsistsof– pointers(foreignkey)toeach

ofthedimensions(multi-dimensionalcoordinates)

– numericvalueforthosecoordinates

• Eachdimensiontablecontainsattributesofthatdimension

11

NosupportforattributehierarchiesDukeCS,Fall2018 CompSci516:DatabaseSystems

Page 12: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

DimensionHierarchies

• Foreachdimension,thesetofvaluescanbeorganizedinahierarchy:

PRODUCT TIME LOCATION

categoryweekmonthstate

pnamedatecity

year

quartercountry

DukeCS,Fall2018 CompSci516:DatabaseSystems 12

Page 13: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

ROLAP:SnowflakeSchema

13

• Refinesstar-schema• Dimensionalhierarchyis

explicitlyrepresented

• (+)Dimensiontableseasiertomaintain– supposethe“category

descriptionisbeingchanged

• (-)Needadditionaljoins

• FactConstellations– Multiplefacttablessharesome

dimensionaltables– e.g.ProjectedandActual

Expensesmaysharemanydimensions

DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 14: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

OLAPand

DataCube

DukeCS,Fall2018 CompSci516:DatabaseSystems 14

Page 15: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Motivation:OLAPQueries• Dataanalystsareinterestedinexploringtrendsandanomalies– Possiblybyvisualization(Excel)- 2Dor3Dplots– “DimensionalityReduction”bysummarizingdataandcomputing

aggregates

• Findtotalunitsalesforeach1. Model2. Model,brokenintoyears3. Year,brokenintocolors4. Year5. Model,brokenintocolor,….

15

Sales(Model,Year,Color,Units)

DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 16: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Roll-Ups• Analysisreportsstartatacoarselevel,

gotofinerlevels• Orderofattributematters• Notrelationaldata(emptycellsno

keys)

Model Year Color Model,Year,Color Model,Year Model

Chevy 1994 Black 50

Chevy 1994 White 40

90

Chevy 1995 Black 115

Chevy 1995 White 85

200

290

GROUPBY

Sales(Model,Year,Color,Units)

Roll-ups

Drill-downs

• Roll-upsareasymmetric

Page 17: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

‘ALL’ ConstructEasiertovisualizeroll-upifallowALLtofillinthesuper-aggregates

SELECT Model, Year, Color, SUM(Units)

FROM SalesWHERE Model = ‘Chevy’

GROUP BY Model, Year, Color

UNION

SELECT Model, Year, ‘ALL’, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’GROUP BY Model, Year

UNION

UNION

SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’;

Model Year Color Units

Chevy 1994 Black 50

Chevy 1994 White 40

Chevy 1994 ‘ALL’ 90

Chevy 1995 Black 85

Chevy 1995 White 115

Chevy 1995 ‘ALL’ 200

Chevy ‘ALL’ ‘ALL’ 290

Sales(Model,Year,Color,Units)

Page 18: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

DataCube:Intuition

SELECT ‘ALL’, ‘ALL’, ‘ALL’, sum(units)

FROM Sales

UNIONSELECT ‘ALL’, ‘ALL’, Color, sum(units)

FROM Sales

GROUP BY Color

UNIONSELECT ‘ALL’, Year, ‘ALL’, sum(units)

FROM Sales

GROUP BY Year

UNION SELECT Model, Year, ‘ALL’, sum(units)

FROM Sales

GROUP BY Model, Year

UNION ….

18

Sales(Model,Year,Color,Units)

YearColor

Mod

el

TotalUnitsales

DukeCS,Fall2018 CompSci516:DatabaseSystems

• Howmanysub-queries?• Howmanysub-queriesfor

8attributes?

MorecomplextodothesewithGROUP-BY

Page 19: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

DataCube• Computestheaggregateonallpossiblecombinationsof

groupbycolumns.

• IfthereareNattributes,thereare2N-1super-aggregates.

• IfthecardinalityoftheNattributesareC1,...,CN,thenthereareatotalof(C1+1)...(CN+1)valuesinthecube.

• ROLL-UPissimilarbutjustlooksatNaggregates

Page 20: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

DataCube

Ack:fromslidesbyLaurelOrrandJeremyHyrkas,UW

Page 21: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

DataCube Syntax

• SQLServer

SELECT Model, Year, Color, sum(units)

FROM Sales

GROUP BY Model, Year, Color

WITH CUBE

Sales(Model,Year,Color,Units)

Page 22: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

ImplementingDataCube

DukeCS,Fall2018 CompSci516:DatabaseSystems 22

Page 23: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

BasicIdeas

• Needtocomputeallgroup-by-s:– ABCD,ABC,ABD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D

• ComputeGROUP-BYsfrompreviouslycomputedGROUP-BYs– e.g.firstABCD– thenABCorACD– thenABorAC…

• WhichorderABCDissorted,mattersforsubsequentcomputations

– if(ABCD)isthesortedorder,ABCischeap,ACDorBCDisexpensive

Page 24: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Notations

• ABCD– group-byonattributesA,B,C,D– noguaranteeontheorderoftuples

• (ABCD)– sortedaccordingtoA->B->C->D

• ABCDand(ABCD)and(BCDA)– allcontainthesameresults– butindifferentsortedorder

Page 25: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Optimization1:SmallestParent

• ComputeGROUP-BYfromthesmallest(size)previouslycomputedGROUP-BYasaparent

– ABcanbecomputedfromABC,ABD,orABCD

– ABCorABDbetterthanABCD– EvenABCorABDmayhave

differentsizes,trytochoosethesmallerparent

25

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2018 CompSci516:DatabaseSystems

LATTICESTRUCTUREofdatacube

Page 26: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Optimization2:CacheResults

• CacheresultofoneGROUP-BYinmemorytoreducediskI/O– ComputeABfromABCwhile

ABCisstillinmemory

26

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 27: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Optimization3:AmortizeDiskScans

• AmortizediskreadsformultipleGROUP-BYs– SupposetheresultforABCD

isstoredondisk– ComputeallofABC,ABD,

ACD,BCDsimultaneouslyinonescanofABCD

27

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 28: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Optimization4:Sharedpartition

• forhash-basedalgorithms– Useshashtablesto

computesmallerGROUP-Bys

– IfthehashtablesforABandACfitinmemory,computebothinonescanofABC

– OtherwisepartitiononA,andcomputeHTsofABandACindifferentpartitions

• “pipe-hash”algorithmnotcoveredinclass

28

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 29: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

(ABCD)

(ABC) ABD ACD BCD

AC AD BC BD CD(AB)

(A) B C D

all Level0

Level1

Level2

Level3

Level4

A B C sum

a1 b1 c1 5

a1 b1 c2 10

a1 b2 c3 8

a2 b2 c1 2

a2 b2 c3 11

NotSorted

A B C sum

a2 b2 c3 11

a1 b1 c2 10

a2 b2 c1 2

a1 b1 c1 5

a1 b2 c3 8

Sorted

A B sum

a1 b1 15

a1 b2 8

a2 b2 13

Optimization5:Shared-sort

• Fromonesortedorderof(ABCD)

– Compute(ABC)– Compute(AB)– Compute(A)– Compute“all”

Page 30: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

PipeSort:Shared-sortoptimization

• BUT,mayhaveaconflictwith“smallest-parent”optimization– (ABD)->(AB)couldbeabetterchoice– Figureoutthebestparentchoicebyrunning

aweighted-matchingalgorithmlayerbylayer(detailsnotcoveredinclass) (ABCD)

(ABC)

(AB)

(A)

Page 31: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Aftercomputingthe plan,executeallpipelines

PipeSort Algorithminpicture

1.Firstpipelineisexecutedbyonescanofthedata

2.Sort(CBAD)->(BADC),computethesecondpipeline

3.…..

A,S costs

ACBD

ACB ABD ACD BDC

AC AD BC BD CD

A B C D

all

AB

Page 32: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

DataMining

Page 33: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

ReadingMaterialOptionalReading:

1.[RG]:Chapter26

2.“FastAlgorithmsforMiningAssociationRules”Agrawal andSrikant,VLDB1994

23,863citationsonGoogleScholarinNovember2018• 23,038inNovember2017• 20,610inNovember2016• 19,496inApril2016

OneofthemostcitedpapersinCS!

33

• Acknowledgement:Thefollowingslideshavebeenpreparedadaptingtheslidesprovidedbytheauthorsof[RG]andusingseveralpresentationsofthispaperavailableontheinternet(esp.byOfer PasternakandBrianChase)

DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 34: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

FourMainStepsinKDandDM(KDD)

• DataSelection– Identifytargetsubsetofdataandattributesofinterest

• DataCleaning– Removenoiseandoutliers,unifyunits,createnewfields,usedenormalization ifneeded

• DataMining– extractinterestingpatterns

• Evaluation– presentthepatternstotheendusersinasuitableform,e.g.throughvisualization

DukeCS,Fall2018 CompSci516:DatabaseSystems 34

RememberHW1!

Page 35: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

SeveralDM/KD(Research)Problems

• Discoveryofcausalrules• Learningoflogicaldefinitions• Fittingoffunctionstodata• Clustering• Classification• Inferringfunctionaldependenciesfromdata• Finding“usefulness”or“interestingness”ofarule

– SeethecitationsintheAgarwal-Srikant paper– Somediscussedin[RG]Chapter27

DukeCS,Fall2018 CompSci516:DatabaseSystems 35

Page 36: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

• Retailerscollectandstoremassiveamountsofsalesdata– transactiondateandlistofitems

• Associationrules:– e.g.98%customerswhopurchase“tires”and“autoaccessories”alsoget“automotiveservices”done

– Customerswhobuymustardandketchupalsobuyburgers

– Goal:findtheserulesfromjusttransactionaldata(transactionid+listofitems)

MiningAssociationRules

36DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 37: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

• Canbeusedfor– marketingprogramandstrategies– cross-marketing(masse-mail,webpages)– catalogdesign– add-onsales– storelayout– customersegmentation

Applications

37DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 38: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Notations

• ItemsI={i1,i2,…,im}• D:asetoftransactions• EachtransactionT⊆ I– hasanidentifierTID

• AssociationRule– Xà Y(notFunctionalDependency!)– X,Y⊂ I– X∩Y=∅

38DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 39: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

ConfidenceandSupport

• AssociationruleXàY

• Confidence c=|Tr.withXandY|/|Tr.with|X|– c%oftransactionsinDthatcontainXalsocontainY

• Support s=|Tr.withXandY|/|allTr.|– s%oftransactionsinDcontainXandY.

39DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 40: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

SupportExample

TID Cereal Beer Bread Bananas Milk

1 X X X

2 X X X X

3 X X

4 X X

5 X X

6 X X

7 X X

8 X

• Support(Cereal)• 4/8=.5

• Support(CerealàMilk)• 3/8=.375

40DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 41: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

ConfidenceExample

TID Cereal Beer Bread Bananas Milk

1 X X X

2 X X X X

3 X X

4 X X

5 X X

6 X X

7 X X

8 X

• Confidence(CerealàMilk)• 3/4=.75

• Confidence(Bananasà Bread)• 1/3=.33333…

41DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 42: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Xà YisnotaFunctionalDependencyForfunctionaldependencies• F.D.=twotupleswiththesamevalueofofXmusthavethe

samevalueofY– Xà Y =>XZà Y(concatenation)– Xà Y,Yà Z=>Xà Z(transitivity)

Forassociationrules• Xà AdoesnotmeanXYàA

– Maynothavetheminimumsupport– Assumeonetransaction{AX}

• Xà AandAà ZdonotmeanXà Z– Maynothave theminimumconfidence– Assumetwotransactions{XA},{AZ}

42DukeCS,Fall2018 CompSci516:DatabaseSystems

Checkyourself

Page 43: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

ProblemDefinition

• Input– asetoftransactionsD

• Canbeinanyform– afile,relationaltable,etc.

– minsupport(minsup)– minconfidence(minconf)

• Goal:generateallassociationrulesthathave– support>=minsup and– confidence>=minconf

43DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 44: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Decompositionintotwosubproblems

• 1.Apriori– forfinding“large”itemsets withsupport>=minsup– allotheritemsets are“small”

• 2.ThenuseanotheralgorithmtofindrulesXà Ysuchthat– Bothitemsets X∪ YandXarelarge– Xà Yhasconfidence>=minconf

• Paperfocusesonsubproblem 1– ifsupportislow,confidencemaynotsaymuch– subproblem 2infullversionofthepaper

DukeCS,Fall2018 CompSci516:DatabaseSystems 44

Page 45: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

BasicIdeas- 1

• Q.Whichitemset canpossiblyhavelargersupport:ABCDorAB– i.e.whenoneisasubsetoftheother?

• Ans:AB– anysubsetofalargeitemset mustbelarge– SoifABissmall,noneedtoinvestigateABC,ABCDetc.

DukeCS,Fall2018 CompSci516:DatabaseSystems 45

Page 46: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

BasicIdeas- 2

• Startwithindividual(singleton)items{A},{B},…

• Insubsequentpasses,extendthe“largeitemsets”ofthepreviouspassas“seed”

• Generatenewpotentiallylargeitemsets (candidateitemsets)

• Thencounttheiractualsupportfromthedata

• Attheendofthepass,determinewhichofthecandidateitemsets areactuallylarge– becomesseedforthenextpass

• Continueuntilnonewlargeitemsets arefound

DukeCS,Fall2018 CompSci516:DatabaseSystems 46

Page 47: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

OptionalSlidesontheApriori Algorithm

DukeCS,Fall2018 CompSci516:DatabaseSystems 47

Page 48: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Notations

ACTUAL

POTENTIALUsedinbothApriori andAprioriTID

48

• Assumethedatabaseisoftheform<TID,i1,i2,…>whereitemsarestoredinlexicographicorder

• TID = identifierofthetransaction• Alsoworkswhenthedatabaseis“normalized”:eachdatabaserecordis<TID,item>pair

DukeCS,Fall2018 CompSci516:DatabaseSystems

OptionalSlide

Page 49: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

AlgorithmApriori

49

!k

k

kk

t

kt

k-k

k-1

;L Answer

minsup}|c.countC { c L

c.count Cc

,t)(C C Dt

)(L C); k 2; L( k

itemsets} {large 1- L

=

³Î=

++Î

=++¹=

=

end

endend

;do candidates forall

subset begin do ons transactiforall

;genapriori-begin do For

1

1

fCountindividual itemoccurrences

Generatenewk-itemsets candidates

count=0

Findthesupportofallthecandidates

Ct = candidatescontainedint

incrementcount

Takeonlythosewithsupport>= minsup

DukeCS,Fall2018 CompSci516:DatabaseSystems

OptionalSlide

Page 50: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Apriori-Gen

5050

l Joinstep

l Prunestep

1k1k2k2k11

1k1k

1k1k1

k

q.itemp.item,q.itemp.item,...,q.itemp.itemqp,LL

itemqitempitempp.itemC

----

--

--

<== where from

.,.,.,select intoinsert

2

k

k-1

k

c from C) L(s

ets s of c(k-1)-subs C itemsets c

deletethen if

do foralldo forall

Ï

Î

p andqaretwo large

(k-1)-itemsets identicalinallk-2firstitems.

Joinbyaddingthelastitemofqtop

Checkallthesubsets,removeallcandidatewithsome“small” subset

• TakesasargumentLk-1 (thesetofalllargek-1)-itemsets• Returnsasupersetofthesetofalllargek-itemsets byaugmentingLk-1

Lk-1⨝ Lk-1

DukeCS,Fall2018 CompSci516:DatabaseSystems

OptionalSlide

Page 51: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Apriori-GenExample- 1

• {1,2,3}• {1,2,4}{1,3,4}

• {1,3,5}• {2,3,4}

• {1,2,3,4}

Step1:Join(k=4)

Assumenumbers1-5correspondtoindividualitems

L3 C4

51DukeCS,Fall2018 CompSci516:DatabaseSystems

OptionalSlide

Page 52: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Apriori-GenExample- 2

• {1,2,3,4}• {1,3,4,5}

Step1:Join(k=4)

Assumenumbers1-5correspondtoindividualitems

L3 C4• {1,2,3}• {1,2,4}{1,3,4}

• {1,3,5}• {2,3,4}

52DukeCS,Fall2018 CompSci516:DatabaseSystems

OptionalSlide

Page 53: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Apriori-GenExample- 3

• {1,2,3,4}• {1,3,4,5}

Step2:Prune(k=4)• Removeitemsets thatcan’thavethe

requiredsupportbecausethereisasubsetinitwhichdoesn’thavethelevelofsupporti.e.notinthepreviouspass(k-1)

L3 C4

No{1,4,5}existsinL3Rulesout{1,3,4,5}

• {1,2,3}• {1,2,4}{1,3,4}

• {1,3,5}• {2,3,4}

53DukeCS,Fall2018 CompSci516:DatabaseSystems

OptionalSlide

Page 54: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

CorrectnessofApriori

54

1k1k2k2k11

1k1k

1k1k1

k

q.itemp.item,q.itemp.item,...,q.itemp.itemqp,LL

itemqitempitempp.itemC

----

--

--

<== where from

.,.,.,select intoinsert

2

k

k-1

k

c from C) L(s

ets s of c(k-1)-subs C itemsets c

deletethen if

do foralldo forall

Ï

Î

ShowthatCk⊇ Lk• Anysubsetoflargeitemset mustalsobe

large

• foreachpinLk,ithasasubsetqinLk-1• WeareextendingthosesubsetsqinJoin

withanothersubsetq’ofp,whichmustalsobelarge

• equivalenttoextendingLk-1 withallitemsandremovingthose

whose(k-1)subsetsarenotinLk-1• PruneisnotdeletinganythingfromLk

prune

join

Checkyourself

DukeCS,Fall2018 CompSci516:DatabaseSystems

OptionalSlide

Page 55: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Conclusions(of516,Fall2018)

DukeCS,Fall2018 CompSci516:DatabaseSystems 55

Page 56: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Take-Aways

• DBMSBasics

• DBMSInternals

• OverviewofResearchAreas

• Hands-onExperienceinDBsystems

DukeCS,Fall2018 CompSci516:DatabaseSystems 56

Page 57: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

DBSystems• TraditionalDBMS– PostGres,SQL

• Large-scaleDataProcessingSystems– Spark/Scala,AWS

• NewDBMS/NOSQL–MongoDB

• Inaddition– XML,JSON,JDBC,Python/Java

DukeCS,Fall2018 CompSci516:DatabaseSystems 57

Page 58: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

DBBasics

• SQL• RA/LogicalPlans• RC• Datalog–Whyweneededeachoftheselanguages

• NormalForms

DukeCS,Fall2018 CompSci516:DatabaseSystems 58

Page 59: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

DBInternalsandAlgorithms

• Storage• Indexing• OperatorAlgorithms– ExternalSort– JoinAlgorithms

• Cost-basedQueryOptimization• Transactions– ConcurrencyControl– Recovery

DukeCS,Fall2018 CompSci516:DatabaseSystems 59

Page 60: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

Large-scaleProcessingandNewApproaches

• ParallelDBMS• DistributedDBMS• MapReduce• NOSQL

DukeCS,Fall2018 CompSci516:DatabaseSystems 60

Page 61: Data Cube and Data Mining - Duke University · 2018. 11. 27. · •big “Petabyte scale”customers •Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data

OtherTopics

• DataWarehouse/OLAP/DataCube• AssociationRuleMining

• HopesomeofyouwillfurtherexploreDatabaseSystems/DataManagement/DataAnalysis/BigData!

DukeCS,Fall2018 CompSci516:DatabaseSystems 61