28
CPSC 404, Laks V.S. Lakshmanan External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR Chapter 2 (Sec. 2.4-2.5): Garcia-et Molina al. (R1) OTE: sorting using B-trees – to be assigned for rea fter we cover B-trees.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

Embed Size (px)

Citation preview

Page 1: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 1

External Sorting

Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and

Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2)

ORChapter 2 (Sec. 2.4-2.5): Garcia-et

Molina al. (R1)

NOTE: sorting using B-trees – to be assigned for reading after we cover B-trees.

Page 2: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 2

What you will learn from this set of lectures

How to efficiently sort a file so large it won’t all fit in memory?

What knobs are available for me to tune the performance of such a sort algorithm & how do I tune them?

Page 3: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 3

Motivation

A classic problem in computer science! Data requested in sorted order

– e.g., arrange movie records in increase order of ratings. Why is it important?

– is first step in bulk loading (i.e., creating) B+ tree index.– useful also for eliminating duplicate copies in a collection of

records & for aggregation (Why?) [can you think of any other strategy for these tasks?]

– (later) for efficient joins of relations, i.e., Sort-merge algo. Problem: sort 1GB of data with 50MB of RAM.

Page 4: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 4

I/O Computation Model

disk I/O (read/write a block) very expensive compared to processing it once it’s in memory.

random block accesses very common. reasonable model for computation using

secondary storage count only disk I/O’s.

Page 5: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 5

Desiderata for good DBMS algorithms

overlap I/O and cpu processing as much as possible.

use as much data from each block read as possible; depends on record clustering.

cluster records that are accessed together in consecutive blocks/pages.

buffer frequently accessed blocks in memory.

Page 6: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 6

Merge Sort Overview Idea: given B buffer pages of main memory (Sort phase)

– read in B pages of data each time and sort internally

– suppose entire data stored in m * B pages– sort phase produces m sorted runs

(sublists) of B pages each

Page 7: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 7

Merge Sort Overview

(Merge phase) – repeatedly merge two sorted runs

(Pass 0: the sort phase) Pass 1 produces m/2 sorted runs of 2B pages each Pass 2 produces m/4 sorted runs of 4B pages each continue until one sorted run of m * B pages produced

2-way merge can be optimized to k-way merge, where k can be as large as B-1 (see Ch. 13.2 and (R1 OR R2) chapters).

Page 8: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 8

Sorting Example

Setup: 10M records of 100 bytes = 1GB file.

– Stored on Megatron 747 disk, with 4KB blocks, each holding 40 records + header information.

– Suppose each cylinder = 1MB. entire file takes up 1000 cylinders 250 x 1000 blocks.

50M available main memory = 50 x 2^20/(4 x 2^10) = 12,800 blocks = 1/20th of file.

Sort by primary key field. => ==>

Page 9: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 9

Merge Sort

Common main memory sorting algorithms don't optimize disk I/O's. Variants of Merge Sort do better.

Merge = take two sorted lists and repeatedly choose the smaller of the ``heads'' of the lists (head = first among the unchosen).

Example: merge 1,3,4,8 with 2,5,7,9 = 1,2,3,4,5,7,8,9.

Merge Sort based on recursive algorithm: divide list of records into two parts; recursively mergesort the parts, and merge the resulting lists.

Page 10: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 10

Two Phase, Multiway Merge Sort “Vanilla” Merge Sort still not very good in disk

I/O model. log2 n passes, so each record is read/written

from disk log2 n times. 2PMMS: 2 reads + 2 writes per block. Phase 1 1. Fill buffer with records. 2. Sort using favorite main memory sort. 3. Write sorted sublist to disk. 4. Repeat until all records have been put into

one of the sorted sublists. Sorted sublist = SSL = sorted run.

Page 11: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 11

Two Phase, Multiway Merge Sort

Page 12: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 12

Two Phase, Multiway Merge Sort

Page 13: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 13

Two Phase, Multiway Merge Sort

Page 14: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 14

Two Phase, Multiway Merge Sort

Page 15: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 15

2PMMS (contd.) Phase 2 Use one buffer for each of the sorted

sublists and one buffer for output. – i.e., split available buffer into several parts

(logical buffers) and allocate for various purposes.

Available Memory…

Page 16: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 16

2PMMS (contd.) Phase 2 Use one buffer for each of the sorted

sublists and one buffer for output.

Pointersto first unchosen

Select the smallest unchosen

Output buffer

Page 17: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 17

2PMMS (contd.) Phase 2 Use one buffer for each of the sorted

sublists and one buffer for an output block.

Pointersto first unchosen

Select the smallest unchosen

Output buffer

6

5

8

5

Page 18: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 18

2PMMS (contd.)

Initially load input buffers with the first blocks of their respective sorted lists.

Repeatedly run a competition among the first unchosen records of each of the buffered blocks. – Move the record with the least key to the output block;

it is now ``chosen.'' Manage the buffers as needed:

– If an input block is exhausted, get the next block from the same list (i.e., file).

– If the output block is full, write it to disk. (assumes each buffer holds just one block. In

practice,you design for each buffer to hold as many blocks as possible.)

Page 19: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 19

Analysis of Naïve Implementation

Recall problem parameters. Assume blocks are stored at random, so

average access time includes seek + rotational delay + transfer, and is about 15 ms.

File stored on 250,000 blocks, read and written once in each phase.

1,000,000 disk I/O's * 0.015 seconds = 15,000 seconds = 250 minutes = 4+ hours.

Problem and Interlude:How many records can you sort with 2PMMS, under our assumptions about records, main memory, and the disk? What would you do if there were more?

Page 20: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 20

Improving the Running Time of 2PMMS

Here are some techniques that sometimes make secondary memory algorithms more efficient: 1. Group blocks by cylinder (“cylindrification”). 2. One big disk several smaller disks. 3. Mirror disks = multiple copies of the same data. 4. ``Prefetching'' or ``double buffering.''5. Disk scheduling; the ``elevator'' algorithm.

Page 21: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 21

CylindrificationIf we are going to read or write blocks in a known order, place them by cylinder, so once we reach that cylinder we can read block after block, with no seek time or rotational latency.

Application to Phase 1 of 2PMMS: Recall problem parameters. 1. Initially, records on 1000 consecutive cylinders. 2. Load main memory from 50 consecutive cylinders.

Order of blocks read is unimportant, so only time besides transfer time is one random seek

and 49 1 cylinder seeks (neglect). Time to transfer 12,800 blocks at 0.5 ms/block = 6.4 sec.

3. Write each sorted sublist onto 50 consecutive cylinders, so write time is also about 6.4 sec. What about for the whole file?

But in Phase 2: Blocks are read from the 20 sublists in a data dependent order, so we cannot take much advantage of cylinders. (why 20 sublists? ‘cos there is 1000 cylinders worth of data) .

Sin

gle

sort

ed s

ub

list

p

ersp

ecti

ve

Page 22: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 22

Prefetching If we have extra space for main memory buffers, consider loading buffers in advance of need. Similarly, keep output blocks buffered in main memory until it is convenient to write them to disk. Example: Phase 2 of 2PMMS With 50MB of main memory, we can afford to buffer two cylinders for each sublist and for the output (recall: one cylinder = 1MB).

Consume one cylinder for each sublist while the other is being loaded from disk.

Similarly, write one output cylinder while the other is being constructed (via merge).

Page 23: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 23

Prefetching

Thus, seek and rotational latency are made negligible for Phase I (thanks to cylindrification) and are minimized for phase II (thanks to prefetching).

So, the total time for phase 2 is approx. one read plus one write of whole file, but one cylinderful at a time.

Page 24: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 24

Prefetching/Double buffering Illustrated

•Alternately process different halves of each input buffer while the other part is being filled from disk.•Similar idea for output buffer.

i/p buffer

o/p buffer

Page 25: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 25

Using B+ Trees for Sorting

Scenario: Table to be sorted has B+ tree index on sorting column(s).

Idea: Can retrieve records in order by traversing leaf pages.

Is this a good idea? Cases to consider:

– B+ tree is clustered Good idea!– B+ tree is not clusteredCould be a very bad

idea!

Page 26: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 26

Clustered B+ Tree Used for Sorting

Cost: root to the left-most leaf, then retrieve all leaf pages (Alternative 1)

If Alternative 2 is used? Additional cost of retrieving data records: each page fetched just once.

(Direct search)

Data Records

Index

Data Entries("Sequence set")

Page 27: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 27

Unclustered B+ Tree Used for Sorting

Alternative (2) for data entries; each data entry contains rid of a data record. In general, one I/O per data record!

(Directs search)

Data Records

Index

Data Entries("Sequence set")

Page 28: CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec. 13-1-13.5): Ramakrishnan & Gehrke and Chapter 11 (Sec. 11.4-11.5): G-M et al. (R2) OR

CPSC 404, Laks V.S. Lakshmanan 28

Conclusion: Sorting Records!

Sorting has become a blood sport!– Parallel sorting is the name of the game ...

Datamation: Sort 1M records of size 100 bytes– Typical DBMS: 15 minutes– World record: 3.5 seconds – 12-CPU SGI machine, 96 disks, 2GB of RAM

New benchmarks proposed:– Minute Sort: How many can you sort in 1 minute?– Dollar Sort: How many can you sort for $1.00?