34
CS122 – Lecture 5 Winter Term, 2017-2018

CS122 –Lecture 5 Winter Term, 2017-2018courses.cms.caltech.edu/cs122/lectures/CS122Lec05.pdf · 2018-01-20 · table name and column name ... USING and NATURAL joins are more complicated

  • Upload
    ngominh

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

CS122– Lecture5WinterTerm,2017-2018

LastTime:SQLJoinExpressions� Lasttime,begandiscussingSQLjoinsyntax� OriginalSQLform:

� SELECT…FROMt1,t2,…WHEREP� AnyjoinconditionsarespecifiedinWHEREclause

� FROMclauseproducesaCartesianproductoft1,t2,…� t1× t2× …� SchemaproducedbyFROMclauseist1.*È t2.*È…

� ANSI-standardSQL:WHEREclausemayonlyrefertothecolumnsgeneratedbytheFROMclause� AliasesinSELECTclauseshouldn’tbevisible(althoughmanydatabasesmakethemvisibleinWHEREclause)

2

SQLJoinExpressions(2)� SELECT…FROMt1,t2,…WHEREP

� t1× t2× …� SchemaofFROMclauseist1.*È t2.*È…(inthatorder)

� Toavoidambiguity,columnnamesinschemaalsoincludecorrespondingtablenames,e.g.t1.a,t1.b,t2.a,t2.c,etc.� Ifcolumnnameisunambiguous,predicatecanjustusecolumnnamebyitself

� Ifcolumnnameisambiguous,predicatemustspecifybothtablenameandcolumnname

� Example:SELECT*FROMt1,t2WHEREa>5ANDc =20;� Notvalid:columnnameaisambiguous(givenaboveschema)

� Valid:SELECT*FROMt1,t2WHEREt1.a>5ANDc =20;

3

AdditionalSQLJoinSyntax� SQL-92introducedseveralnewforms:

� SELECT…FROMt1JOINt2ONt1.a=t2.a� SELECT…FROMt1JOINt2USING(a1,a2,…)� SELECT…FROMt1NATURALJOINt2� CanspecifyINNER,[LEFT|RIGHT|FULL]OUTERJOIN

� AlsoCROSSJOIN,butcannotspecifyON,USING,orNATURAL

� ONclauseisnotthatchallenging� Similartooriginalsyntax,butallowsinner/outerjoins� Schemaof“FROMt1JOINt2ON…”ist1.*È t2.*

4

AdditionalSQLJoinSyntax(2)� USINGandNATURALjoinsaremorecomplicated

� SELECT…FROMt1JOINt2USING(a1,a2,…)� SELECT…FROMt1NATURALJOINt2� Joinconditionisinferredfromthecommoncolumnnames(NATURALJOIN),orgeneratedfromtheUSINGclause

� Alsoincludesaprojecttoeliminateduplicatecolumnnames(projectispartoftheFROMclause;affectsWHEREpredicate)

� ForSELECT*FROMt1NATURALJOINt2,orSELECT*FROMt1JOINt2USING(a1,a2,…):� DenotethejoincolumnsasJC.Thesehavenotablename.

� Fornaturaljoin,JC=t1Ç t2;otherwise,JC=attrs inUSINGclause� FROMclause’sschemaisJCÈ (t1– JC)È (t2– JC)

5

AdditionalSQLJoinSyntax(3)� ForSELECT*FROMt1NATURAL[???] JOINt2:

� Schemas:t1(a,b)andt2(a,c)� FROMschema:(a,t1.b,t2.c)

� Fornaturalinnerjoin:� Projectcanuseeithert1.a ort2.a togeneratea

� Fornaturalleftouterjoin:� Projectshoulduset1.a;t2.amaybeNULLforsomerows� (Similarfornaturalrightouterjoin,exceptt2.a isused)

� Fornaturalfullouterjoin:� ProjectshoulduseCOALESCE(t1.a,t2.a),sinceeithert1.aort2.a couldbeNULL

6

AdditionalSQLJoinSyntax(4)� SELECTt1.aFROMt1NATURALJOINt2

� Schemas:t1(a,b)andt2(a,c)� FROMschema:(a,t1.b,t2.c)

� ThisqueryisnotvalidundertheANSIstandard,becausethereisnot1.aoutsidetheFROMclause� Somedatabases(e.g.MySQL)willallowthisquery

� Thisqueryisvalid:� SELECTa,t2.cFROMt1NATURALJOINt2� (Technically,canalsosay“SELECTa,c”becausecwon’tbeambiguous)

7

AdditionalSQLJoinSyntax(5)� SELECT*FROMt1NATURALJOINt2NATURALJOINt3

� Schemas:t1(a,b),t2(a,c),t3(a,d)� FROMschema:(a,t1.b,t2.c,t3.d)

� Thisquerypresentsanotherchallenge� Step1:t1NATURALJOINt2

� Joinconditionis:t1.a =t2.a� Resultschemais(a,t1.b,t2.c)

� Step2:natural-jointhisresultwitht3� Joinconditionis:a =t3.a� Problem:column-referencea isambiguous

8

AdditionalSQLJoinSyntax(6)� SELECT*FROMt1NATURALJOINt2NATURALJOINt3

� Schemas:t1(a,b),t2(a,c),t3(a,d)� FROMschema:(a,t1.b,t2.c,t3.d)

� Generateplaceholdertablenamestoavoidambiguities� Step1(revised):t1NATURALJOINt2

� Joinconditionis:t1.a =t2.a� Resultschemais#R1(a,t1.b,t2.c)

� Step2(revised):natural-jointhisresultwitht3� Joinconditionis:#R1.a =t3.a� Resultschemais#R2(a,t1.b,t2.c,t3.d)

9

MappingSQLJoinsintoPlans� Summary:translatingSQLjoinshasitsownchallenges� Primarilycenteraroundnaturaljoins,andjoinswiththeUSINGclause:� Mustgenerateanappropriateschematoeliminateduplicatecolumns

� MustuseCOALESCE()operationsonjoin-columnsusedinfullouterjoins

� Mayneedtodealwithambiguouscolumnnameswhenmorethantwotablesarenatural-joinedtogether

� (Allsurmountable;justannoying…)

10

NestedSubqueries� SQLqueriescanalsoincludenestedsubqueries� SubqueriescanappearintheSELECTclause:

� SELECTcustomer_id,(SELECTSUM(balance)FROMloanJOINborrowerbWHEREb.customer_id=c.customer_id)tot_bal

FROMcustomerc;� (Computetotalofeachcustomer’sloanbalances)

� Mustbeascalarsubquery� Mustproduceexactlyonerowandonecolumn

� Thisisalmostalwaysacorrelatedsubquery� Innerqueryreferstoanenclosingquery’svalues� Requirescorrelatedevaluationtocomputetheresults

11

NestedSubqueries(2)� SubqueriescanalsoappearintheFROMclause:

� SELECTu.username,email,max_scoreFROMusersu,

(SELECTusername,MAX(score)ASmax_scoreFROMgame_scoresGROUPBYusername)ASs

WHEREu.username=s.username;� Calledaderivedrelation

� Thetableisproducedbyasubquery,insteadofbeingreadfromafile(a.k.a.abaserelation)

� Cannotbeacorrelatedsubquery� …atleast,notwithrespecttotheimmediatelyenclosingquery� Couldstillbecorrelatedwithaqueryfurtherout,ifparentappearsinaSELECTexpression,oraWHEREpredicate,etc.

12

NestedSubqueries(3)� SubqueriescanalsoappearintheWHEREclause:

� SELECTemployee_id,last_name,first_nameFROMemployeeseWHEREe.is_manager=0ANDEXISTS(SELECT*FROMemployeesm

WHEREm.department=e.departmentANDm.is_manager=1ANDm.salary<e.salary);

� (Findnon-manageremployeeswhomakemoremoneythansomemanagerinthesamedepartment)

� Also,IN/NOTINoperators,ANY/SOME/ALLqueries,andscalarsubqueriesaswell

� Again,couldbeacorrelatedsubquery,andoftenis.L

13

NestedSubqueries(4)� Previousexample:

� SELECTemployee_id,last_name,first_nameFROMemployeeseWHEREis_manager=0ANDEXISTS(SELECT*FROMemployeesm

WHEREm.department=e.departmentANDm.is_manager=1ANDm.salary<e.salary);

� NotethatEXISTS/NOTEXISTScancompleteafteronlyonerowisgeneratedbysubquery� Don’tneedtoevaluateentiresubqueryresult…� Definitelywanttooptimizesubplantoproducefirstrowasquicklyaspossible

14

SubqueriesinFROMClause� FROMsubqueries aretheeasiesttodealwith!J

� SELECTu.username,email,max_scoreFROMusersu,

(SELECTusername,MAX(score)ASmax_scoreFROMgame_scoresGROUPBYusername)ASs

WHEREu.username=s.username;� Togenerateexecutionplanforfullquery:

� Simplygenerateexecutionplanforthederivedrelation(e.g.recursivecalltoplannerwithsubquery’sAST)

� Usethesubquery’splanasaninputintotheouterquery(asifitwereanothertableintheFROMclause)

15

SubqueriesinFROMClause(2)� Ourexample:

� SELECTu.username,email,max_scoreFROMusersu,

(SELECTusername,MAX(score)ASmax_scoreFROMgame_scoresGROUPBYusername)ASs

WHEREu.username=s.username;� Subqueryplan:

game_scores

G usernameMAX(score)

users

θ

Πu.username,u.email,s.max_score

u.username =s.username

game_scores

G usernameMAX(score)

� Fullplan:

16

FROMSubqueriesandViews� Viewswillalsocreatesubqueries intheFROMclause

� CREATEVIEWtop_scoresASSELECTusername,MAX(score)ASmax_scoreFROMgame_scoresGROUPBYusername;

� SELECTu.username,email,max_scoreFROMusersu,top_scoressWHEREu.username=s.username;

� Simplesubstitutionofview’sdefinitioncreatesanestedsubquery intheFROMclause:� SELECTu.username,email,max_scoreFROMusersu,(SELECTusername,MAX(score)ASmax_score

FROMgame_scores GROUPBYusername)sWHEREu.username =s.username;

17

FROMSubqueriesandViews(2)� Twooptionsastohowthisisdone� Option1:

� Whenviewiscreated,databasecanconstructarelationalalgebraplanfortheview,andsaveit.

� Whenaqueryreferencestheview,simplyusetheview’splanasasubplaninthereferencingquery.

� Option2:� Whenviewiscreated,databaseparsesandverifiestheSQL,butdoesn’tgeneratearelationalalgebraplan.

� Whenaqueryreferencestheview,modifythequery’sSQLtousetheview’sdefinition,thengenerateaplan.

� Secondoptionrequiresmoreworkduringplanning,butpotentiallyallowsforgreateroptimizationstobeapplied

18

SubqueriesinSELECTClause� SubqueriesintheSELECTclausemustbescalarsubqueries:

� SELECTcustomer_id,(SELECTSUM(balance)FROMloanJOINborrowerbWHEREb.customer_id =c.customer_id)tot_bal

FROMcustomerc;� Mustproduceexactlyonerowandonecolumn

� Aneasy,generallyusefulapproach:� Representscalarsubquery asspecialkindofexpression� Duringqueryplanning,generateaplanforthesubquery� Whenselect-expressionisevaluated,recursivelyinvokethequeryexecutortoevaluatethesubquery togeneratearesult

� (Reportanerrorifdoesn’tproduceexactlyonerow/column!)

19

SubqueriesinSELECTClause(2)� SubqueriesintheSELECTclausemustbescalarsubqueries:

� SELECTcustomer_id,(SELECTSUM(balance)FROMloanJOINborrowerbWHEREb.customer_id =c.customer_id)tot_bal

FROMcustomerc;� Mustproduceexactlyonerowandonecolumn

� Ifscalarsubquery iscorrelated:� Mustreevaluatethesubquery foreachrowinouterquery

� Ifscalarsubquery isn’tcorrelated:� Canevaluatesubqueryonceandcachetheresult� (Thisisanoptimization;correlatedevaluationwillalsowork,althoughitisobviouslyunnecessarilyslow.)

20

SubqueriesinSELECTClause(3)� Correlatedscalarsubqueries intheSELECTclausecanfrequentlyberestatedasadecorrelated outerjoin:� SELECTcustomer_id,

(SELECTSUM(balance)FROMloanJOINborrowerbWHEREb.customer_id =c.customer_id)tot_bal

FROMcustomerc;� Equivalentto:

� SELECTc.customer_id,tot_balFROMcustomercLEFTOUTERJOIN

(SELECTb.customer_id,SUM(balance)tot_balFROMloanJOINborrowerbGROUPBYb.customer_id)tONt.customer_id =c.customer_id);

� Usually,outerjoinischeaperthancorrelatedevaluation

21

ScalarSubqueriesinOtherClauses� Scalarsubqueries canalsoappearinotherpredicates,e.g.WHEREclauses,HAVINGclauses,ONclauses,etc.

� Thesecasesaremorelikelytobeuncorrelated,whichmeanstheycanbeevaluatedonceandthencached

� Iftheyarecorrelated,theycanalsooftenberestatedasajoininanappropriatepartoftheexecutionplan� But,itcangetsignificantlymorecomplicated…

22

SubqueriesinWHEREClause� IN/NOTINclausesandEXISTS/NOTEXISTSpredicatescanalsoappearinWHEREandHAVINGclauses

� Example:FindbankcustomerswithaccountsatanybankbranchinLosAngeles� SELECT*FROMcustomercWHEREcustomer_id IN

(SELECTcustomer_id FROMdepositorNATURALJOINaccountNATURALJOINbranchWHEREbranch_city ='LosAngeles');

� Isthisquerycorrelated?� No;innerquerydoesn’treferenceenclosingqueryvalues

23

SubqueriesinWHEREClause(2)� Again,canimplementIN/EXISTSinasimpleandgenerallyusefulway:� CreatespecialINandEXISTSexpressionoperatorsthatincludeasubquery

� Duringplanning,anexecutionplanisgeneratedforeachsubquery inanINorEXISTSexpression

� WhenINorEXISTSexpressionisevaluated,recursivelyinvoketheexecutortoevaluatesubquery andtestrequiredcondition� e.g.INscansthegeneratedresultsfortheLHSvalue� e.g.EXISTSreturnstrueifarowisgeneratedbysubquery,orfalseifnorowsaregeneratedbythesubquery

24

SubqueriesinWHEREClause(3)� IN/NOTINclausesandEXISTS/NOTEXISTSpredicatescanalsobecorrelated� EXISTS/NOTEXISTSsubqueries arealmostalwayscorrelated

� Ifsubquery isnotcorrelated,canmaterializesubqueryresultsandreusethem� …buttheymaybelarge;wemaystillendupbeingverrry slow

� Previousapproachisn’tanywherenearideal� INoperatoreffectivelyimplementsajoinoperation,butwithoutanyoptimizations

� EXISTSisabitfaster,butsubquery isfrequentlycorrelated� Wouldgreatlyprefertoevaluatesubquery usingjoins,particularlyifwecaneliminatecorrelatedevaluation!

25

Semijoin andAntijoin� TwousefulrelationalalgebraoperationsinthecontextofIN/NOTINandEXISTS/NOTEXISTSqueries

� Relationsr(R)ands(S)� Thesemijoin r s isthecollectionofallrowsinr thatcanjoinwithsomecorrespondingrowins� {tr |tr Î r Ù $ ts Î s (join(tr,ts))}� join(tr,ts) isthejoincondition

� r s equivalenttoΠR(r s),butonlywithsets oftuples� Ifr ands aremultisets,theseexpressionsarenotequivalent,sinceatupleinr thatmatchesmultipletuplesinswillbecomeduplicatedinthenaturaljoin’sresult

26

Semijoin andAntijoin (2)� Theantijoin r s isthecollectionofallrowsinr thatdon’tjoinwithsomecorrespondingrowins� {tr |tr Î r Ù ¬$ ts Î s (join(tr,ts))}

� Alsocalledanti-semijoin,sincer s isequivalenttor – r s (isthecomplementof)

� Bothsemijoin andantijoin operationsareeasytocomputewithourvariousjoinalgorithms� Canincorporateintotheta-joinimplementationseasily

� CanusetheseoperationstorestatemanyIN/NOTINandEXISTS/NOTEXISTSqueries

27

ExampleINSubquery� Findallbankcustomerswhohaveanaccountatanybankbranchinthecitytheylivein� SELECT*FROMcustomercWHEREc.customer_city IN

(SELECTb.branch_cityFROMbranchbNATURALJOINaccounta

NATURALJOINdepositordWHEREd.customer_id =c.customer_id);

� Recall:brancheshaveabranch_name andabranch_city� Innerqueryisclearlycorrelatedwithouterquery� Naïvecorrelatedevaluationwouldbevery slowL

� Jointhreetablesininnerqueryforeverybankcustomer!

28

ExampleINSubquery (2)� Examplequery:

� SELECT*FROMcustomercWHEREc.customer_city IN(SELECTb.branch_cityFROMbranchbNATURALJOINaccounta

NATURALJOINdepositordWHEREd.customer_id =c.customer_id);

� Candecorrelate byextractinginnerquery,modifyingittofindallbranchesforallcustomers,inoneshot:� SELECT*FROMbranchbNATURALJOINaccounta

NATURALJOINdepositord� Includestuplesforeachbranchthateachcustomerhasaccountsat

29

ExampleINSubquery (3)� Couldtakeourinnerqueryandjoinitagainstcustomer

� SELECTc.*FROMcustomercJOIN(SELECT*FROMbranchbNATURALJOINaccounta

NATURALJOINdepositord)AStON(t.customer_id =c.customer_id AND

c.customer_city =t.branch_city);� Problems?

� Ifacustomerhasmultipleaccountsatlocalbranches,thecustomerwillappearmultipletimesintheresult

� Cause:theoutermostjoinwillduplicatecustomerrowsforeachmatchingrowinnestedquery

� Solution:useasemijoin tojoincustomerstothesubquery

30

ExampleINSubquery (4)� Ouroriginalcorrelatedquery:

� SELECT*FROMcustomercWHEREc.customer_city IN(SELECTb.branch_cityFROMbranchbNATURALJOINaccounta

NATURALJOINdepositordWHEREd.customer_id =c.customer_id);

� Thedecorrelated query:� SELECT*FROMcustomercSEMIJOIN

(SELECT*FROMbranchbNATURALJOINaccountaNATURALJOINdepositord)ASt

ON(t.customer_id =c.customer_id ANDc.customer_city =t.branch_city);

� (Note:writingasemijoin inSQLisn’twidelysupported…)

31

ExampleNOTEXISTSSubquery� Asimplerquery:findcustomerswhohavenobankbranchesintheirhomecity� SELECT*FROMcustomercWHERENOTEXISTS(SELECT*FROMbranchb

WHEREb.branch_city =c.customer_city);� Again,thisqueryrequirescorrelatedevaluation

� Notasbadaspreviousquery,sinceNOTEXISTSonlyhastoproduceonerowfrominnerquery,notalltherows…

� Ifthere’sanindexonbranch_city,thiswon’tbehorriblyslow,butagain,weareimplementingajoinhere

� (Wehavefastequijoinalgorithms;whynotusethem?)

32

ExampleNOTEXISTSSubquery (2)� Examplequery:

� SELECT*FROMcustomercWHERENOTEXISTS(SELECT*FROMbranchb

WHEREb.branch_city =c.customer_city);� Thisqueryisveryeasytowritewithanantijoin:

� SELECT*FROMcustomercANTIJOIN branchbONbranch_city =customer_city;

� Couldalsowritewithanouterjoin:� SELECTc.*FROMcustomercLEFTJOINbranchb

ONbranch_city =customer_cityWHEREbranch_city ISNULL;

� Thisapproachwon’tcreateduplicatesofcustomers,likeourpreviousINexamplewouldhave…

33

Summary:NestedSubqueries� Onlyscratchedthesurfaceofsubquery translationandoptimization� Anincrediblyrichtopic– tonsofinterestingresearch!

� Canusebasictoolswediscussedtodaytodecorrelateandoptimizeaprettybroadrangeofsubqueries� Outerjoins,sometimesagainstgroup/aggregateresults� Semijoins andantijoins forset-membershipsubqueries

� Animportantquestion,notconsideredfornow:� Isthetranslatedversionactuallyfaster?(Orwhenmultipleoptions,whichoptionisfastest?)

� Aplanner/optimizermustmakethatdecision

34