99
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #5: Hashing and Sor5ng

CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

CS  5614:  (Big)  Data  Management  Systems  

B.  Aditya  Prakash  Lecture  #5:  Hashing  and  Sor5ng  

Page 2: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

(Sta7c)  Hashing  

§  Problem:  “find  EMP  record  with  ssn=123”  § What  if  disk  space  was  free,  and  5me  was  at  premium?  

Prakash  2014   VT  CS  5614   2  

Page 3: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Hashing  

§  A:  Brilliant  idea:  key-­‐to-­‐address  transforma5on:  

Prakash  2014   VT  CS  5614   3  

#0  page  

#123  page  

#999,999,999  

123;  Smith;  Main  str  

Page 4: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Hashing  

§  Since  space  is  NOT  free:  §  use  M,  instead  of  999,999,999  slots  §  hash  func5on:  h(key)  =  slot-­‐id  

Prakash  2014   VT  CS  5614   4  

#0  page  

#123  page  

#999,999,999  

123;  Smith;  Main  str  

Page 5: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Hashing  

§  Typically:  each  hash  bucket  is  a  page,  holding  many  records:  

Prakash  2014   VT  CS  5614   5  

#0  page  

#h(123)  

M  

123;  Smith;  Main  str  

Page 6: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Hashing  

§  No5ce:  could  have  clustering,  or  non-­‐clustering  versions:  

Prakash  2014   VT  CS  5614   6  

#0  page  

#h(123)  

M  

123;  Smith;  Main  str.  

Page 7: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Hashing  

§  No5ce:  could  have  clustering,  or  non-­‐clustering  versions:  

Prakash  2014   VT  CS  5614   7  

123  ...  

#0  page  

#h(123)  

M  

...  EMP  file  

123;  Smith;  Main  str.  

...  

234;  Johnson;  Forbes  ave  

345;  Tompson;  Fi]h  ave  

...  

Page 8: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Design  decisions  

§  1)  formula  h()  for  hashing  func5on  §  2)  size  of  hash  table  M  §  3)  collision  resolu5on  method  

Prakash  2014   VT  CS  5614   8  

Page 9: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Design  decisions  -­‐  func7ons  

§  Goal:  uniform  spread  of  keys  over  hash  buckets  

§  Popular  choices:  – Division  hashing  – Mul5plica5on  hashing  

Prakash  2014   VT  CS  5614   9  

Page 10: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Division  hashing  

§  h(x)  =  (a*x+b)    mod    M  §  eg.,  h(ssn)  =  (ssn)  mod  1,000  

– gives  the  last  three  digits  of  ssn  § M:  size  of  hash  table  -­‐  choose  a  prime  number,  defensively  (why?)  

Prakash  2014   VT  CS  5614   10  

Page 11: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Division  hashing  

Prakash  2014   VT  CS  5614   11  

§  eg.,  M=2;  hash  on  driver-­‐license  number  (dln),  where  last  digit  is  ‘gender’  (0/1  =  M/F)  

§  in  an  army  unit  with  predominantly  male  soldiers  

§  Thus:  avoid  cases  where  M  and  keys  have  common  divisors  -­‐  prime  M  guards  against  that!  

Page 12: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Mul7plica7on  hashing  

Prakash  2014   VT  CS  5614   12  

h(x)  =  [  frac-onal-­‐part-­‐of  (  x  *  φ  )  ]  *  M  

§  φ:  golden  ra5o  (  0.618...  =  (  sqrt(5)-­‐1)/2  )    

§  in  general,  we  need  an  irra5onal  number  

§  advantage:  M  need  not  be  a  prime  number  

§  but  φ  must  be  irra5onal  

Page 13: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Other  hashing  func7ons  

Prakash  2014   VT  CS  5614   13  

§  quadra5c  hashing  (bad)  §  ...  

Page 14: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Other  hashing  func7ons  

Prakash  2014   VT  CS  5614   14  

§  quadra5c  hashing  (bad)  §  ...  

§  conclusion:  use  division  hashing  

Page 15: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Size  of  hash  table  

Prakash  2014   VT  CS  5614   15  

§  eg.,  50,000  employees,  10  employee-­‐records    /  page  

§  Q:  M=??  pages/buckets/slots  

Page 16: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Size  of  hash  table  

Prakash  2014   VT  CS  5614   16  

§  eg.,  50,000  employees,  10  employees/page  

§  Q:  M=??  pages/buckets/slots  

§  A:  u5liza5on  ~  90%  and  – M:  prime  number  

Eg.,  in  our  case:  M=  closest  prime  to  50,000/10  /  0.9  =  5,555    

Page 17: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Collision  resolu7on  

§  Q:  what  is  a  ‘collision’?  §  A:  ??  

Prakash  2014   VT  CS  5614   17  

Page 18: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Collision  resolu7on  

Prakash  2014   VT  CS  5614   18  

#0  page  

#h(123)  

M  

123;  Smith;  Main  str.  

Page 19: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Collision  resolu7on  

§  Q:  what  is  a  ‘collision’?  §  A:  ??  §  Q:  why  worry  about  collisions/overflows?  (recall  that  buckets  are  ~90%  full)  

§  A:  ‘birthday  paradox’  

Prakash  2014   VT  CS  5614   19  

Page 20: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Collision  resolu7on  

§  open  addressing  –  linear  probing  (ie.,  put  to  next  slot/bucket)  –  re-­‐hashing  

§  separate  chaining  (ie.,  put  links  to  overflow  pages)  

Prakash  2014   VT  CS  5614   20  

Page 21: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Collision  resolu7on  

Prakash  2014   VT  CS  5614   21  

#0  page  

#h(123)  

M  

123;  Smith;  Main  str.  

linear  probing:  

Page 22: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Collision  resolu7on  

Prakash  2014   VT  CS  5614   22  

#0  page  

#h(123)  

M  

123;  Smith;  Main  str.  

re-­‐hashing  

h1()  

h2()  

Page 23: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Collision  resolu7on  

Prakash  2014   VT  CS  5614   23  

123;  Smith;  Main  str.  

separate  chaining  

Page 24: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Design  decisions  -­‐  conclusions  

§  func5on:  division  hashing    – h(x)  =  (  a*x+b  )  mod  M  

§  size  M:  ~90%  u5l.;  prime  number.  §  collision  resolu5on:  separate  chaining    

– easier  to  implement  (dele5ons!);  –   no  danger  of  becoming  full  

Prakash  2014   VT  CS  5614   24  

Page 25: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Problem  with  sta7c  hashing  

§  problem:  overflow?  §  problem:  underflow?  (underu5liza5on)  

Prakash  2014   VT  CS  5614   25  

Page 26: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Solu7on:  Dynamic/extendible  hashing  

§  idea:  shrink  /  expand  hash  table  on  demand..  §  ..dynamic  hashing  §  Details:  how  to  grow  gracefully,  on  overflow?  § Many  solu5ons  -­‐  One  of  them:  ‘extendible  hashing’  [Fagin  et  al]  

Prakash  2014   VT  CS  5614   26  

Page 27: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Extendible  hashing  

Prakash  2014   VT  CS  5614   27  

#0  page  

#h(123)  

M  

123;  Smith;  Main  str.  

Page 28: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Extendible  hashing  

Prakash  2014   VT  CS  5614   28  

#0  page  

#h(123)  

M  

123;  Smith;  Main  str.  

solu7on:    

split  the  bucket  in  two  

Page 29: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Extendible  hashing  

Prakash  2014   VT  CS  5614   29  

in  detail:  §  keep  a  directory,  with  ptrs  to  hash-­‐buckets  §  Q:  how  to  divide  contents  of  bucket  in  two?  §  A:  hash  each  key  into  a  very  long  bit  string;  keep  only  as  many  bits  as  needed  

Eventually:  

Page 30: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Extendible  hashing  

Prakash  2014   VT  CS  5614   30  

directory  

00...  01...  

10...  

11...  

10101...  

10110...  

1101...  

10011...  

0111...  0001...  

101001...  

Page 31: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Extendible  hashing  

Prakash  2014   VT  CS  5614   31  

directory  

00...  01...  

10...  

11...  

10101...  

10110...  

1101...  

10011...  

0111...  0001...  

101001...  

Page 32: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Extendible  hashing  

Prakash  2014   VT  CS  5614   32  

directory  

00...  01...  

10...  

11...  

10101...  

10110...  

1101...  

10011...  

0111...  0001...  

101001...  

split  on  3-­‐rd  bit  

Page 33: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Extendible  hashing  

Prakash  2014   VT  CS  5614   33  

directory  

00...  01...  

10...  

11...  

1101...  

10011...  

0111...  0001...  

101001...  10101...  

10110...  

new  page  /  bucket  

Page 34: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Extendible  hashing  

Prakash  2014   VT  CS  5614   34  

directory  (doubled)  

1101...  

10011...  

0111...  0001...  

101001...  10101...  

10110...  

new  page  /  bucket  

000...  001...  

010...  

011...  

100...  101...  

110...  

111...  

Page 35: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Extendible  hashing  

Prakash  2014   VT  CS  5614   35  

00...  01...  

10...  

11...  

10101...  

10110...  

1101...  

10011...  

0111...  0001...  

101001...  

000...  001...  

010...  

011...  

100...  101...  

110...  

111...  

1101...  

10011...  

0111...  0001...  

101001...  10101...  

10110...  

BEFORE   AFTER  

Page 36: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Extendible  hashing  

§  Summary:  directory  doubles  on  demand  §  or  halves,  on  shrinking  files  §  needs  ‘local’  and  ‘global’  depth  

Prakash  2014   VT  CS  5614   36  

Page 37: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  overview  

§ Mo5va5on  §  main  idea  §  search  algo  §  inser5on/split  algo  §  dele5on  

Prakash  2014   VT  CS  5614   37  

Page 38: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  

§ Mo5va5on:  ext.  hashing  needs  directory  etc  etc;  which  doubles  (ouch!)  

§  Q:  can  we  do  something  simpler,  with  smoother  growth?  

Prakash  2014   VT  CS  5614   38  

Page 39: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  

§ Mo5va5on:  ext.  hashing  needs  directory  etc  etc;  which  doubles  (ouch!)  

§  Q:  can  we  do  something  simpler,  with  smoother  growth?  

§  A:  split  buckets  from  le]  to  right,  regardless  of  which  one  overflowed  (‘crazy’,  but  it  works  well!)  -­‐  Eg.:  

Prakash  2014   VT  CS  5614   39  

Page 40: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  

Prakash  2014   VT  CS  5614   40  

Ini5ally:  h(x)  =  x  mod  N        (N=4  here)  

Assume  capacity:  3  records  /  bucket  

Insert  key  ‘17’  

0   1   2   3  bucket-­‐  id  

4        8   5        9  13  

6   7      11  

Page 41: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  

Prakash  2014   VT  CS  5614   41  

Ini5ally:  h(x)  =  x  mod  N        (N=4  here)  

0   1   2   3  bucket-­‐  id  

4        8   5        9  13  

6   7      11  

17  overflow  of  bucket#1  

Page 42: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  

Prakash  2014   VT  CS  5614   42  

Ini5ally:  h(x)  =  x    mod  N        (N=4  here)  

0   1   2   3  bucket-­‐  id  

4        8   5        9  13  

6   7      11  

17  overflow  of  bucket#1  

Split  #0,  anyway!!!  

Page 43: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  

Prakash  2014   VT  CS  5614   43  

Ini5ally:  h(x)  =  x  mod  N        (N=4  here)  

0   1   2   3  bucket-­‐  id  

4        8   5        9  13  

6   7      11  

17  Split  #0,  anyway!!!  

Q:  But,  how?  

Page 44: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  

Prakash  2014   VT  CS  5614   44  

A:  use  two  h.f.:    h0(x)  =  x  mod  N  

                                                     h1(x)  =  x  mod  (2*N)  

0   1   2   3  bucket-­‐  id  

4        8   5        9  13  

6   7      11  

17  

Page 45: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  a]er  split:  

Prakash  2014   VT  CS  5614   45  

A:  use  two  h.f.:    h0(x)  =  x  mod  N  

                                                     h1(x)  =  x  mod  (2*N)  

0   1   2   3  bucket-­‐  id  

8     5        9  13  

6   7      11  

17  

4  

4  

Page 46: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  a]er  split:  

Prakash  2014   VT  CS  5614   46  

A:  use  two  h.f.:    h0(x)  =  x  mod  N  

                                                     h1(x)  =  x  mod  (2*N)  

0   1   2   3  bucket-­‐  id  

8     5        9  13  

6   7      11  

17  

4  

overflow  

4  

Page 47: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  a]er  split:  

Prakash  2014   VT  CS  5614   47  

A:  use  two  h.f.:    h0(x)  =  x  mod  N  

                                                     h1(x)  =  x  mod  (2*N)  

0   1   2   3  bucket-­‐  id  

8     5        9  13  

6   7      11  

17  

4  

overflow  

4  

split  ptr  

Page 48: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  searching?  

Prakash  2014   VT  CS  5614   48  

h0(x)  =    x  mod  N                (for  the  un-­‐split  buckets)  h1(x)  =    x  mod  (2*N)  (for  the  spliEed  ones)  

0   1   2   3  bucket-­‐  id  

8     5        9  13  

6   7      11  

17  

4  

overflow  

4  

split  ptr  

Page 49: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  searching?  

Prakash  2014   VT  CS  5614   49  

Q1:  find  key  ‘6’?                  Q2:  find  key  ‘4’?    

Q3:  key  ‘8’?  

0   1   2   3  bucket-­‐  id  

8     5        9  13  

6   7      11  

17  

4  

overflow  

4  

split  ptr  

Page 50: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  searching?  

Prakash  2014   VT  CS  5614   50  

Algo  to  find  key  ‘k’:  

•   compute  b=  h0(k);    

•   if    b<split-­‐ptr,  compute  b=h1(k)  

•   search  bucket  b  

Page 51: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  inser7on?  

Prakash  2014   VT  CS  5614   51  

Algo:  insert  key  ‘k’  

•   compute  appropriate  bucket  ‘b’  

•   if  the  overflow  criterion  is  true  • split  the  bucket  of  ‘split-­‐ptr’  •   split-­‐ptr  ++  (*)  

Page 52: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  inser7on?  

Prakash  2014   VT  CS  5614   52  

§  no5ce:  overflow  criterion  is  up  to  us!!  §  Q:  sugges5ons?  

Page 53: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  inser7on?  

Prakash  2014   VT  CS  5614   53  

§  no5ce:  overflow  criterion  is  up  to  us!!  §  Q:  sugges5ons?  §  A1:  space  u5liza5on  >=  u-­‐max  

Page 54: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  inser7on?  

Prakash  2014   VT  CS  5614   54  

§  no5ce:  overflow  criterion  is  up  to  us!!  §  Q:  sugges5ons?  §  A1:  space  u5liza5on  >  u-­‐max  §  A2:  avg  length  of  ovf  chains  >  max-­‐len  §  A3:  ....  

Page 55: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  inser7on?  

Prakash  2014   VT  CS  5614   55  

Algo:  insert  key  ‘k’  

•   compute  appropriate  bucket  ‘b’  

•   if  the  overflow  criterion  is  true  • split  the  bucket  of  ‘split-­‐ptr’  •   split-­‐ptr  ++  (*)  

what  if  we  reach  the  right  edge??  

Page 56: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  split  now?  

Prakash  2014   VT  CS  5614   56  

h0(x)  =  x  mod  N              (for  the  un-­‐split  buckets)  h1(x)  =    x  mod  (2*N)    for  the  spliEed  ones)  

split  ptr  

0   1   2   3   4   5   6  

Page 57: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  split  now?  

Prakash  2014   VT  CS  5614   57  

h0(x)  =    x  mod  N                (for  the  un-­‐split  buckets)  h1(x)  =    x  mod  (2*N)    (for  the  spliEed  ones)  

split  ptr  

0   1   2   3   4   5   6   7  

   

Page 58: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  split  now?  

Prakash  2014   VT  CS  5614   58  

h0(x)  =  x  mod  N                (for  the  un-­‐split  buckets)  h1(x)  =  x  mod  (2*N)    (for  the  spliEed  ones)  

split  ptr  

0   1   2   3   4   5   6   7  

   

Page 59: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  split  now?  

Prakash  2014   VT  CS  5614   59  

h0(x)  =  x  mod  N                  (for  the  un-­‐split  buckets)  h1(x)  =  x  mod  (2*N)    (for  the  spliEed  ones)  

split  ptr  

0   1   2   3   4   5   6   7  

Page 60: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  split  now?  

Prakash  2014   VT  CS  5614   60  

split  ptr  

0   1   2   3   4   5   6   7  

this  state  is  called  ‘full  expansion’  

Page 61: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  observa7ons  

Prakash  2014   VT  CS  5614   61  

In  general,  at  any  point  of  5me,  we  have  at  most  two  h.f.  ac5ve,  of  the  form:  

• hn(x)  =    x  mod  (N  *  2n)          

• hn+1(x)  =  x  mod  (N  *  2n+1)  

(aHer  a  full  expansion,  we  have  only  one  h.f.)  

 

 

Page 62: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  dele7on?  

§  reverse  of  inser5on:  

Prakash  2014   VT  CS  5614   62  

Page 63: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  dele7on?  

Prakash  2014   VT  CS  5614   63  

§  reverse  of  inser5on:  §  if  the  underflow  criterion  is  met  

– contract!  

Page 64: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  how  to  contract?  

Prakash  2014   VT  CS  5614   64  

h0(x)  =  mod  N                      (for  the  un-­‐split  buckets)  h1(x)  =  mod  (2*N)          (for  the  spliEed  ones)  

split  ptr  

0   1   2   3   4   5   6  

Page 65: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Linear  hashing  -­‐  how  to  contract?  

Prakash  2014   VT  CS  5614   65  

h0(x)  =  mod  N                      (for  the  un-­‐split  buckets)  h1(x)  =  mod  (2*N)          (for  the  spliEed  ones)  

split  ptr  

0   1   2   3   4   5  

   

Page 66: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Hashing  -­‐  pros?  

Prakash  2014   VT  CS  5614   66  

Page 67: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Hashing  -­‐  pros?  

§  Speed,    – on  exact  match  queries  – on  the  average  

Prakash  2014   VT  CS  5614   67  

Page 68: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

B(+)-­‐trees  -­‐  pros?  

Prakash  2014   VT  CS  5614   68  

Page 69: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

B(+)-­‐trees  -­‐  pros?  

§  Speed  on  search:  – exact  match  queries,  worst  case  –  range  queries  – nearest-­‐neighbor  queries  

§  Speed  on  inser5on  +  dele5on  §  smooth  growing  and  shrinking  (no  re-­‐org)  

Prakash  2014   VT  CS  5614   69  

Page 70: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Conclusions  

§  B-­‐trees  and  variants:  in  all  DBMSs  §  hash  indices:  in  some  

–  (but  hashing  in  useful  for  joins…)  

Prakash  2014   VT  CS  5614   70  

Page 71: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

SORTING  

Prakash  2014   VT  CS  5614   71  

Page 72: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   72  

Why  Sort?  

Page 73: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   73  

Why  Sort?  

§  select  ...  order by    –  e.g.,  find  students  in  increasing  gpa  order  

§ bulk  loading  B+  tree  index.  § duplicate  eliminaLon  (select distinct)  §  select ... group by  §  Sort-­‐merge  join  algorithm  involves  sor5ng.  

Page 74: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   74  

Outline  

§  two-­‐way  merge  sort  §  external  merge  sort  §  fine-­‐tunings  §  B+  trees  for  sor5ng  

Page 75: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   75  

2-­‐Way  Sort:  Requires  3  Buffers  §  Pass  0:  Read  a  page,  sort  it,  write  it.  

–  only  one  buffer  page  is  used  

§  Pass  1,  2,  3,  …,  etc.:  requires  3  buffer  pages  – merge  pairs  of  runs  into  runs  twice  as  long  –  three  buffer  pages  used.  

Main memory buffers

INPUT 1

INPUT 2

OUTPUT

Disk Disk

Page 76: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   76  

Two-­‐Way  External  Merge  Sort  

§  Each  pass  we  read  +  write  each  page  in  file.  

Input file

1-page runs

2-page runs

4-page runs

8-page runs

PASS 0

PASS 1

PASS 2

PASS 3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,6 2,6 4,9 7,8 1,3 2

2,3 4,6

4,7 8,9

1,3 5,6 2

2,3 4,4 6,7 8,9

1,2 3,5 6

1,2 2,3 3,4 4,5 6,6 7,8

Page 77: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   77  

Two-­‐Way  External  Merge  Sort  

§  Each  pass  we  read  +  write  each  page  in  file.  

Input file

1-page runs

2-page runs

4-page runs

8-page runs

PASS 0

PASS 1

PASS 2

PASS 3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,6 2,6 4,9 7,8 1,3 2

2,3 4,6

4,7 8,9

1,3 5,6 2

2,3 4,4 6,7 8,9

1,2 3,5 6

1,2 2,3 3,4 4,5 6,6 7,8

Page 78: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   78  

Two-­‐Way  External  Merge  Sort  

§  Each  pass  we  read  +  write  each  page  in  file.  

Input file

1-page runs

2-page runs

4-page runs

8-page runs

PASS 0

PASS 1

PASS 2

PASS 3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,6 2,6 4,9 7,8 1,3 2

2,3 4,6

4,7 8,9

1,3 5,6 2

2,3 4,4 6,7 8,9

1,2 3,5 6

1,2 2,3 3,4 4,5 6,6 7,8

Page 79: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   79  

Two-­‐Way  External  Merge  Sort  

§  Each  pass  we  read  +  write  each  page  in  file.  

Input file

1-page runs

2-page runs

4-page runs

8-page runs

PASS 0

PASS 1

PASS 2

PASS 3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,6 2,6 4,9 7,8 1,3 2

2,3 4,6

4,7 8,9

1,3 5,6 2

2,3 4,4 6,7 8,9

1,2 3,5 6

1,2 2,3 3,4 4,5 6,6 7,8

Page 80: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   80  

Two-­‐Way  External  Merge  Sort  

§  Each  pass  we  read  +  write  each  page  in  file.  

§  N  pages  in  the  file  =>    §  So  total  cost  is:        §  Idea:    Divide  and  conquer:  

sort  subfiles  and  merge  

⎡ ⎤= +log2 1N

⎡ ⎤( )2 12N Nlog +

Input file

1-page runs

2-page runs

4-page runs

8-page runs

PASS 0

PASS 1

PASS 2

PASS 3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,6 2,6 4,9 7,8 1,3 2

2,3 4,6

4,7 8,9

1,3 5,6 2

2,3 4,4 6,7 8,9

1,2 3,5 6

1,2 2,3 3,4 4,5 6,6 7,8

Page 81: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   81  

External  merge  sort  

B  >  3  buffers  §  Q1:  how  to  sort?  §  Q2:  cost?  

Page 82: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   82  

General  External  Merge  Sort  

B Main memory buffers Disk Disk

.  .  .  .  .  .  

 B>3  buffer  pages.    How  to  sort  a  file  with  N  pages?  

.  .  .  

Page 83: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   83  

General  External  Merge  Sort  

–  Pass  0:  use  B  buffer  pages.  Produce                            sorted  runs  of  B  pages  each.    

–  Pass  1,  2,  …,    etc.:  merge  B-­‐1  runs.    

⎡ ⎤N B/

B Main memory buffers

INPUT 1

INPUT B-1

OUTPUT

Disk Disk

INPUT 2

.  .  .   .  .  .  .  .  .  

Page 84: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   84  

Sor7ng  

– create  sorted  runs  of  size  B  (how  many?)  – merge    them  (how?)  

B

... ...

Page 85: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   85  

Sor7ng  

–  create  sorted  runs  of  size  B  –  merge  first  B-­‐1  runs  into  a  sorted  run  of          (B-­‐1)  *B,  ...  

B

... ... …..

Page 86: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   86  

Sor7ng  

–  How  many  steps  we  need  to  do?            ‘i’,  where      B*(B-­‐1)^i  >  N    –  How  many  reads/writes  per  step?  N+N  

B

... ... …..

Page 87: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   87  

Cost  of  External  Merge  Sort  

§  Number  of  passes:  §  Cost  =  2N  *  (#  of  passes)  

⎡ ⎤⎡ ⎤1 1+ −log /B N B

Page 88: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   88  

Cost  of  External  Merge  Sort  

§  E.g.,  with  5  buffer  pages,  to  sort  108  page  file:  –  Pass  0:                                      =  22  sorted  runs  of  5  pages  each  (last  run  is  only  3  pages)    

–  Pass  1:                                  =  6  sorted  runs  of  20  pages  each  (last  run  is  only  8  pages)  

–  Pass  2:    2  sorted  runs,  80  pages  and  28  pages  –  Pass  3:    Sorted  file  of  108  pages  

⎡ ⎤108 5/

⎡ ⎤22 4/

Formula  check:  ┌log4  22┐=  3  …  +  1  à  4  passes    √  

Page 89: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   89  

Number  of  Passes  of  External  Sort  

N B=3 B=5 B=9 B=17 B=129 B=257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1,000,000 20 10 7 5 3 3 10,000,000 23 12 8 6 4 3 100,000,000 26 14 9 7 4 4 1,000,000,000 30 15 10 8 5 4

(  I/O  cost  is  2N  5mes  number  of  passes)  

Page 90: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   90  

§  Quicksort  is  a  fast  way  to  sort  in  memory.  

Internal  Sort  Algorithm  

Page 91: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   91  

Blocked  I/O  &  double-­‐buffering  

§  So  far,  we  assumed  random  disk  access  §  Cost  changes,  if  we  consider  that  runs  are  wriyen  (and  read)  sequen5ally  

§ What  could  we  do  to  exploit  it?  

Page 92: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   92  

Blocked  I/O  &  double-­‐buffering  

§  So  far,  we  assumed  random  disk  access  §  Cost  changes,  if  we  consider  that  runs  are  wriyen  (and  read)  sequen5ally  

§ What  could  we  do  to  exploit  it?  §  A1:  Blocked  I/O  (exchange  a  few  r.d.a  for  several  sequen5al  ones)  

§  A2:  double-­‐buffering  

Page 93: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Double  Buffering  §  To  reduce  wait  5me  for  I/O  request  to  complete,  can  prefetch  into  `shadow  block’.    –  Poten5ally,  more  passes;  in  prac5ce,  most  files  s5ll  sorted  in  2-­‐3  passes.  

OUTPUT

OUTPUT'

Disk Disk

INPUT 1

INPUT k

INPUT 2

INPUT 1'

INPUT 2'

INPUT k'

block size b

Prakash  2014   VT  CS  5614   93  

Page 94: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   94  

Using  B+  Trees  for  Sor7ng  

§  Scenario:  Table  to  be  sorted  has  B+  tree  index  on  sor5ng  column(s).  

§  Idea:  Can  retrieve  records  in  order  by  traversing  leaf  pages.  

§  Is  this  a  good  idea?  §  Cases  to  consider:  

–  B+  tree  is  clustered      –  B+  tree  is  not  clustered    

Page 95: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   95  

Using  B+  Trees  for  Sor7ng  

§  Scenario:  Table  to  be  sorted  has  B+  tree  index  on  sor5ng  column(s).  

§  Idea:  Can  retrieve  records  in  order  by  traversing  leaf  pages.  

§  Is  this  a  good  idea?  §  Cases  to  consider:  

–  B+  tree  is  clustered    Good  idea!  –  B+  tree  is  not  clustered  Could  be  a  very  bad  idea!  

Page 96: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   96  

Clustered  B+  Tree  Used  for  Sor7ng  

 §  Cost:  root  to  the  le]-­‐most  leaf,  then  retrieve  all  leaf  pages  (Alterna5ve  1)  

 

   Always  beyer  than  external  sor5ng!  

(Directs search)

Data Records

Index

Data Entries ("Sequence set")

Page 97: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   97  

Unclustered  B+  Tree  Used  for  Sor7ng  

§  Alterna5ve  (2)  for  data  entries;  each  data  entry  contains  rid  of  a  data  record.    In  general,  one  I/O  per  data  record!  

(Directs search)

Data Records

Index

Data Entries ("Sequence set")

Page 98: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   98  

External  Sor7ng  vs.  Unclustered  Index  

 p:  #  of  records  per  page    B=1,000  and  block  size=32  for  sor-ng    p=100  is  the  more  realis-c  value.  

N Sorting p=1 p=10 p=100 100 200 100 1,000 10,000 1,000 2,000 1,000 10,000 100,000 10,000 40,000 10,000 100,000 1,000,000 100,000 600,000 100,000 1,000,000 10,000,000 1,000,000 8,000,000 1,000,000 10,000,000 100,000,000 10,000,000 80,000,000 10,000,000 100,000,000 1,000,000,000

Page 99: CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

Prakash  2014   VT  CS  5614   99  

Summary  

§  External  sor5ng  is  important  §  External  merge  sort  minimizes  disk  I/O  cost:  

–  Pass  0:  Produces  sorted  runs  of  size  B  (#  buffer  pages).    

–  Later  passes:  merge  runs.  

§  Clustered  B+  tree  is  good  for  sor5ng;  unclustered  tree  is  usually  very  bad.