25
3/12/2015 1 Chapter 11 Balanced Trees Why care about advanced implementations? Same entries, different insertion sequence: Not good! Would like to keep tree balanced. Balanced binary tree The disadvantage of a binary search tree is that its height can be as large as N1 This means that the time needed to perform insertion and deletion and many other operations can be O(N) in the worst case We want a tree with small height A binary tree with N node has height at least (log N) Thus, our goal is to keep the height of a binary search tree O(log N) Such trees are called balanced binary search trees. Examples are AVL tree, redblack tree. AVL Trees A sorted binary tree The heights of two subtrees at any given node differ by at most 1 New Height Definition For convenience, we redefine Height of a node The height of a leaf is 1. The height of a null node is zero. The height of an internal node is the maximum height of its children plus 1 Note that this definition of height is different from the one we defined previously (we defined the height of a leaf as zero previously). AVL Tree Class BinaryNode KeyType: Key integer: Height BinaryNode: LeftChild BinaryNode: RightChild BinaryNode: parent // optional Constructor(KeyType: key) Key = key Height = 1 End Constructor End Class

Chapter11-01.ppt - University of Iowahomepage.cs.uiowa.edu/~hzhang/c31/notes/Chapter11-01.pdf · • An AVL tree is a binary search tree in which – for ... • When the tree structure

  • Upload
    vukiet

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

3/12/2015

1

Chapter 11 Balanced Trees

Why care about advanced implementations?

Same entries, different insertion sequence:

Not good! Would like to keep tree balanced.

Balanced binary tree

• The disadvantage of a binary search tree is that its height can be as large as N‐1

• This means that the time needed to perform insertion and deletion and many other operations can be O(N) in the worst case

• We want a tree with small height

• A binary tree with N node has height at least  (log N) • Thus, our goal is to keep the height of a binary search tree 

O(log N)

• Such trees are called balanced binary search trees.  Examples are AVL tree, red‐black tree.

AVL Trees

• A sorted binary tree

• The heights of two subtrees at any given node differ by at most 1

New Height Definition

For convenience, we redefine Height of a node

• The height of a leaf is 1.  The height of a null node is zero.

• The height of an internal node is the maximum height of its children plus 1

Note that this definition of height is different from the one we 

defined previously (we defined the height of a leaf as zero previously).

AVL Tree 

Class BinaryNode

KeyType: Key

integer: Height

BinaryNode: LeftChild

BinaryNode: RightChild

BinaryNode: parent // optional

Constructor(KeyType: key)

Key = key

Height = 1

End Constructor

End Class

3/12/2015

2

AVL tree

• An AVL tree is a binary search tree in which

– for every node in the tree, the height of the left and right subtrees differ by at most 1.

AVL property violated here

The left tree is AVL; the right tree is not AVL.

AVL tree• Let x be the root of an AVL tree of height h• Let Nh denote the minimum number of nodes in an AVL 

tree of height h• We have

• By repeated substitution, we obtain the general form

• The boundary conditions are: N1=1 and N2 =2. This implies that h = O(log Nh).

• Thus, many operations (searching, insertion, deletion) on an AVL tree will take O(log N) time.

2

2

21

2

12

1

h

h

hhh

N

N

NNN

ihi

h NN 22

Insertion

• First, insert the new key as a new leaf just as in an ordinary binary search tree

• Then trace the path from the new leaf towards the root.  For each node x encountered, check if heights of x.LeftChild and x.RightChild differ by at most 1.

• If yes, proceed to parent(x).  If not, restructure by doing either a single rotation or a double rotation.

• For insertion, once we perform a rotation at a node x, we won’t need to perform any rotation at any ancestor of x.

Rotations

• Since an insertion/deletion involves adding/deleting a single node, this can only increase/decrease the height of some subtree by 1

• Thus, if the AVL tree property is violated at a node x, it means that the heights of left(x) ad right(x) differ by exactly 2.

• Rotations will be applied to x to restore the AVL tree property.

Rotations

• When the tree structure changes (e.g., insertion or deletion), we need to transform the tree to restore the AVL tree property.

• This is done using single rotations or double rotations.

x

y

AB

C

y

x

AB C

Before RotationAfter Rotation

e.g. Single Rotation

Insertion & Rotation 

• Let x be the node at which left(x) and right(x) differ by 2

• Assume that the height of x is h+3

• There are 4 cases

– Height of left(x) is h+2 (i.e. height of right(x) is h)

• Height of left(left(x)) is h+1  single rotate with left child

• Height of right(left(x)) is h+1  double rotate with left child

– Height of right(x) is h+2 (i.e. height of left(x) is h)

• Height of right(right(x)) is h+1  single rotate with right child

• Height of left(right(x)) is h+1  double rotate with right child

3/12/2015

3

Single Right Rotation

The new key is inserted in the subtree A. The AVL-property is violated at xheight of left(x) is h+2height of right(x) is h.

Single Left Rotation

Single rotation takes O(1) time. Insertion takes O(log N) time.

The new key is inserted in the subtree C. The AVL-property is violated at x.

keys(A) < x < keys(B) < y < keys(C)

5

3

1 4

Insert 0.8

AVL Tree

8

0.8

5

3

1 4

8

x

y

A

B

C

3

51

0.84 8

After rotation

rightRotate( BinaryNode: x ){

BinaryNode y, z;

y = x.LeftChild;x.LeftChild = y.RightChild;(y.RightChild).parent = x;y.RightChild = x;z = y.parent = x.parent;x.parent = y;

if (z == Nil )Root = y;

else if ( x == z.LeftChild )z.LeftChild = y;

elsez.RightChild = y;

}

z z

leftRotate( BinaryNode: x ){

BinaryNode y, z;

y = x.RightChild;x.RightChild = y.LeftChild;(y.LeftChild).parent = x;y.LeftChild = x;z = y.parent = x.parent;x.parent = y;

if (z == Nil )Root = y;

else if ( x == z.LeftChild )z.LeftChild = y;

elsez.RightChild = y;

}

z z

Double rotationThe new key is inserted in the subtree B1 or B2. The AVL-property is violated at x.x-y-z forms a zig-zag shape

also called left-right rotate

Keys(A) < y < keys(B1) < z < keys(B2) < x < keys(C)

3/12/2015

4

Double rotation

The new key is inserted in the subtree B1 or B2. The AVL-property is violated at x.

also called right-left rotate

keys(A) < x < keys(B1) < z < keys(B2) < y < keys(C)

5

3

1 4

Insert 3.5

AVL Tree

8

3.5

5

3

1 4

8

4

5

1

3

3.5 After Rotation

x

y

A z

B

C

8

leftRightRotate( BinaryNode: x ){

leftRotate(x.LeftChild);rightRotate(x);

}

rightLeftRotate( BinaryNode: x ){

rightRotate(x.RightChild);leftRotate(x);

}

An Extended Example

Insert 3,2,1,4,5,6,7, 16,15,14

3

Fig 1

3

2

Fig 2

3

2

1

Fig 3

2

1 3Fig 4

2

1 3

4Fig 5

2

1 3

4

5

Fig 6

Single rotation

Single rotation

2

1 4

53

Fig 7 6

2

1 4

53

Fig 8

4

2 5

61 3

Fig 9

4

2 5

61 3

7Fig 10

4

2 6

71 3

5 Fig 11

Single rotation

Single rotation

4

2 6

71 3

5 16

Fig 12

4

2 6

71 3

5 16

15Fig 13

4

2 6

151 3 5

167Fig 14

Double rotation

3/12/2015

5

5

4

2 7

151 3 6

1614

Fig 16

4

2 6

151 3 5

167

14

Fig 15

Double rotation

Deleting Nodes

• Use the same rotations

Deletion 

• Delete a node x as in ordinary binary search tree.  Note that the last node deleted is a node with 0 or 1 child.

• Then trace the path from that node towards the root.

• For each node x encountered, check if heights of x.LeftChild and x.RightChild differ by at most 1.  If yes, proceed to x.parent.  If not, perform an appropriate rotation at x.  There are 4 cases as in the case of insertion.

• For deletion, after we perform a rotation at x, we may have to perform a rotation at some ancestor of x. Thus, we must continue to trace the path until we reach the root. 

Deletion 

• On closer examination: the single rotations for deletion can be divided into 4 cases (instead of 2 cases)

– Two cases for rotate with left child

– Two cases for rotate with right child

Single rotations in deletion

rotate with left child

In both figures, a node is deleted in subtree C, causing the height to drop to h. The height of y is h+2. When the height of subtree A is h+1, the height of B can be h or h+1. Fortunately, the same single rotation can correct both cases.

Single rotations in deletion

rotate with right child

In both figures, a node is deleted in subtree A, causing the height to drop to h. The height of y is h+2. When the height of subtree C is h+1, the height of B can be h or h+1. A single rotation can correct both cases.

3/12/2015

6

Rotations in deletion

• There are 4 cases for single rotations, but we do not need to distinguish among them.

• There are exactly two cases for double rotations (as in the case of insertion)

• Therefore, we can reuse exactly the same procedure for insertion to determine which rotation to perform

2‐3 Trees

• A node contains one or two key values (called 2‐nodes or 3‐nodes, respectively)

• Every internal 2‐node has two children

• Every internal 3‐node has three children

• All leaves are at the same level.

2‐3 Trees

each internal node has either 2 or 3 children all leaves are at the same level

Features

Example of 2‐3 Tree

2‐3 Trees with Ordered Nodes

2-node 3-node

• leaf node can be either a 2-node or a 3-node

Traversing a 2‐3 Treeinorder(TwoThreeTree: ttt)

if (ttt is a leaf)visit the data item of ttt

else if (ttt has two data items){

inorder(left subtree of ttt)visit the first data iteminorder(middle subtree of ttt)visit the second data iteminorder(right subtree of ttt)

}else{

inorder(left subtree of ttt)visit the data iteminorder(right subtree of ttt)

}

3/12/2015

7

Searching a 2‐3 Tree

TreeItemType: retrieveItem(TwoThreeTree: ttt, KeyType: key)

if(key is inside ttt’s node)return the data portion of key

else if (ttt is a leaf)return NIL

else {let sttt be the appropriate subtreereturn retrieveItem(sttt, key)

}

What did we gain?

What is the time efficiency of searching for an item?

Because every internal node has at least two children, a tree containing N nodes can have a height of at most log2(N)

Gain: Ease of Keeping the Tree Balanced

Binary SearchTree

2-3 Tree

both trees afterinserting items39, 38, ... 32

Inserting ItemsInsert 39

Inserting Items

Insert 38

1) insert 38 in leaf2) divide leaf

and move middlevalue up to parent

3) result

Inserting Items

Insert 37

3/12/2015

8

Inserting ItemsInsert 36

insert in leaf

divide leafand move middlevalue up to parent

overcrowdednode

Inserting Items... still inserting 36

divide overcrowded node,move middle value up to parent,

attach children to smallest and largest

result

Inserting Items

After Insertion of 35, 34, 33

Inserting so far

Inserting so far Inserting ItemsHow do we insert 32?

3/12/2015

9

Inserting Items creating a new root if necessary tree grows at the root

Inserting ItemsFinal Result

70

Deleting ItemsDelete 70

80

Deleting ItemsDeleting 70: swap 70 with inorder successor (80)

Deleting Items

Deleting 70: ... get rid of 70

Deleting ItemsResult

3/12/2015

10

Deleting ItemsDelete 100

Deleting ItemsDeleting 100

Deleting ItemsResult

Deleting ItemsDelete 80

Deleting ItemsDeleting 80 ...

Deleting ItemsDeleting 80 ...

3/12/2015

11

Deleting ItemsDeleting 80 ...

Deleting ItemsFinal Result

comparison withbinary search tree

Deletion Algorithm I

1. Locate node n, which contains item I

2. If node n is not a leaf swap I with inorder successor

deletion always begins at a leaf

3. If leaf node n contains another item, just delete item Ielse

try to redistribute nodes from siblings (see next slide)if not possible, merge node (see next slide)

Deleting item I:

Deletion Algorithm II

A sibling has 2 items: redistribute item

between siblings andparent

No sibling has 2 items: merge node move item from parent

to sibling

Redistribution

Merging

Deletion Algorithm III

Internal node n has no item left redistribute

Redistribution not possible: merge node move item from parent

to sibling adopt child of n

If n's parent ends up without item, apply process recursively

Redistribution

Merging

Deletion Algorithm IV

If merging process reaches the root and root is without item delete root

3/12/2015

12

Operations of 2‐3 Trees

all operations have time complexity of log n

2‐3‐4 Trees• similar to 2-3 trees• 4-nodes can have 3 items and 4 children

4-node

2‐3‐4 Tree Example 2‐3‐4 Tree: Insertion

Insertion procedure:• similar to insertion in 2-3 trees• items are inserted at the leafs• since a 4-node cannot take another item,

4-nodes are split up during insertion process

Strategy• on the way from the root down to the leaf:

split up all 4-nodes "on the way" insertion can be done in one pass

(remember: in 2-3 trees, a reverse pass might be necessary)

2‐3‐4 Tree: Insertion

Inserting 60, 30, 10, 20, 50, 40, 70, 80, 15, 90, 100

2‐3‐4 Tree: Insertion

Inserting 60, 30, 10, 20 ...

... 50, 40 ...

3/12/2015

13

2‐3‐4 Tree: Insertion

Inserting 50, 40 ...

... 70, ...

2‐3‐4 Tree: Insertion

Inserting 70 ...

... 80, 15 ...

2‐3‐4 Tree: Insertion

Inserting 80, 15 ...

... 90 ...

2‐3‐4 Tree: Insertion

Inserting 90 ...

... 100 ...

(b)

2‐3‐4 Tree: Insertion

Inserting 100 ...

2‐3‐4 Tree: Insertion Procedure 

Splitting 4-nodes during Insertion

3/12/2015

14

2‐3‐4 Tree: Insertion Procedure 

Splitting a 4-node whose parent is a 2-node during insertion

2‐3‐4 Tree: Insertion Procedure 

Splitting a 4-node whose parent is a 3-node during insertion

2‐3‐4 Tree: DeletionDeletion procedure:

• similar to deletion in 2-3 trees• items are deleted at the leafs swap item of internal node with inorder successor

• note: a 2-node leaf creates a problem

Strategy (different strategies possible)

• on the way from the root down to the leaf:turn 2-nodes (except root) into 3-nodes

deletion can be done in one pass(remember: in 2-3 trees, a reverse pass might be necessary)

Red-Black Tree

• A ref-black tree is a binarysearch such that each nodehas a color of either red or black.

• The root is black.

• Every path from a node to a leaf contains the samenumber of black nodes.

• If a node is red then itsparent must be black.

Class BinaryNodeKeyType: KeyBoolean: isRedBinaryNode: LeftChildBinaryNode: RightChildBinaryNode: parent

Constructor(KeyType: key)Key = keyisRed = true

End ConstructorEnd Class

ExampleThe root is black.

The parent of any red node must be

black.

Maintaining the Red Black Properties in a Tree

• Insertions

• Must maintain rules of Red Black Tree.

• New Node always a leaf

– can't be black or we will violate rule of the same # of blacks along any path

– therefore the new leaf must be red

– If parent is black, done (trivial case)

– if parent red, things get interesting because a red leaf with a red parent violates no double red rule.

3/12/2015

15

The parent of a red node must be black.

Algorithm: InsertionA red-black tree is a particular binary search tree, so create a new node as red and insert it as in normal search tree.

What property may be violated?

579

Violation!

7

Algorithm: InsertionWe have detected a need for balance when z is red and its parent, too.

• If z has a red uncle: colour the parent and uncle black, and grandparent red. Then replace z by grandparent to see if new z’s parent is red.

z

Algorithm: InsertionWe have detected a need for balance when z is red and his parent too.

• If z has a red uncle: colour the parent and uncle black, and grandparent red. Then replace z by grandparent to see if new z’s parent is red.

• If z is a left child and has a black uncle: colour the parent black and the grandparent red, then rightRotate(z.parent.parent)

z

rotateRight(G)G

P S

EDX C

A B

Relative to G, X is at left-left positions. rotateRight(G) will exchange of roles between G and P, so P becomes G's parent. Also must recolor P and G.

89

After rotateRight(G)P

X G

SCA B

EDApparent rule violation?

rotateLeft(G) will handle the case when X is at right right position relative to G.

Algorithm: InsertionWe have detected a need for balance when z is red and his parent too.

• If z has a red uncle: colour the parent and uncle black, and grandparent red. Then replace z by grandparent to see if z’s parent is red.

• If z is a left child and has a black uncle: colour the parent black and the grandparent red, then rotateRight(z.parent.parent)• If z is a right child and has a black uncle, then rotateLeft(z.parent) and

3/12/2015

16

Double Rotation• What if X is at left right relative to G?

– a single rotation will not work

• Must perform a double rotation

– rotate X and P

– rotate X and GG

P S

EDXA

B C

After Double Rotati

X

P G

SCA B

EDDouble rotation is also needed when X is at right left position relative to G.

G

X S

EDP C

A B

Example of Inserting Sorted Numbers

• 1 2 3 4 5 6 7 8 9 10

1

Insert 1. A leaf so red. Realize it isroot so recolorto black.

1

Insert 2

1

2

make 2 red. Parentis black so done.

Insert 3

1

2

3

Insert 3. Parent is red. Parent's sibling is black(null) 3 is outside relative to grandparent. Rotateparent and grandparent

2

1 3

Insert 4

2

1 3

On way down see2 with 2 red children.Recolor 2 red andchildren black.Realize 2 is rootso color back to black

2

1 3

4

When adding 4parent is blackso done.

3/12/2015

17

Insert 52

1 3

4

5

5's parent is red.Parent's sibling isblack (null). 5 isoutside relative tograndparent (3) so rotateparent and grandparent thenrecolor

Finish insert of 52

1 4

3 5

Insert 62

1 4

3 5

On way down see4 with 2 redchildren. Make4 red and childrenblack. 4's parent isblack so no problem.

Finishing insert of 62

1 4

3 5

6

6's parent is blackso done.

Insert 72

1 4

3 5

6

7

7's parent is red.Parent's sibling isblack (null). 7 isoutside relative tograndparent (5) so rotate parent and grandparent then recolor

Finish insert of 72

1 4

3 6

5 7

3/12/2015

18

Insert 82

1 4

3 6

5 7

On way down see 6with 2 red children.Make 6 red andchildren black. Thiscreates a problembecause 6's parent, 4, isalso red. Must performrotation.

Still Inserting 82

1 4

3 6

5 7

Recolored nowneed torotate

Finish inserting 84

2

3

6

5 71

8

Recolored nowneed torotate

Insert 94

2

3

6

5 71

8

9

On way down see 4 has two red childrenso recolor 4 red and children black. Realize 4 is the root so recolor black

Finish Inserting 94

2

3

6

5 81

7 9After rotations and recoloring

Insert 104

2

3

6

5 81

7 9On way down see 8 has twored children so change 8 tored and children black

10

3/12/2015

19

Insert 114

2

3

6

5 81

7 9

10

11

Again a rotation isneeded.

110

Finish inserting 114

2

3

6

5 81

7 10

9 11

Properties of Red Black Trees

• If a Red node has any children, it must have two children and they must be Black. (Why?)

• If a Black node has only one child that child must be a Red leaf. (Why?)

• Due to the rules there are limits on how unbalanced a Red Black tree may become. 

Red‐Black Tree vs 2‐3‐4 Tree

• binary-search-tree representation of 2-3-4 tree

• 3- and 4-nodes are represented by equivalent binary trees

• Each 2-3-4 node generates exactly one black node (on the top), and zero red node for 2-nodes, one red for 3-nodes, and two red ones for 4-nodes.

Red‐Black Representation of 4‐node Red‐Black Representation of 3‐node

3/12/2015

20

Red‐Black Tree Example Red‐Black Tree Example

117

Multiway Search Trees

A multiway search tree of order m, or an m‐way search tree, is an m‐ary tree in which:

1. Each node has up to m children and m‐1 keys

2. The keys in each node are in ascending order

3. The keys in the first i children are smaller than the ith key

4. The keys in the last m‐i children are larger than the ith key

118

A 5‐Way Search Tree

393533

14

52

55402216

1918

159 25

14131110

B‐tree• B‐tree is a generalization of 2‐3‐4 tree with a large number of 

branches.

• A B‐tree of order m is an m‐way search tree (i.e., a tree where each node may have up to m children) in which:

1. the number of keys in each non‐leaf node is one less than the number of its children.

2. all leaves are on the same level

3. all non‐leaf nodes except the root have at least m / 2children

4. the root is either a leaf node, or it has from two to mchildren

5. a leaf node contains no more than m – 1 keys

• The number m is always odd120

An example B‐Tree

51 6242

6 12

26

55 60 7064 9045

1 2 4 7 8 13 15 18 25

27 29 46 48 53

A B-tree of order 5containing 26 items

Note that all the leaves are at the same level

3/12/2015

21

121

A Typical Disk Drive

122

Disk Access

Disk Access Time = 

Seek Time (moving disk head to correct track)

+ Rotational Delay (rotating disk to correct block in track)

+ Transfer Time (time to transfer block of 

data to main memory)

Motivation for B‐Trees

• Index structures for large datasets cannot be stored in main memory

• Storing it on disk requires different approach to efficiency

• Assuming that a disk spins at 3600 RPM,  one revolution occurs in 1/60 of a second, or 16.7ms

• Crudely speaking, one disk access takes about the same time as 200,000 instructions

Motivation (cont.)

• Assume that we use an AVL tree to store about 20 million records

• We end up with a very deep binary tree with lots of different disk accesses; log2 20,000,000 is about 24, so this takes about 0.2 seconds  

• We know we can’t improve on the log n lower bound on search for a binary tree

• But, the solution is to use more branches and thus reduce the height of the tree!

– As branching increases, depth decreases

125

A B‐Tree of Order 5

To find the location of a key, traverse the keys at the root sequentially until at a pointer where any key before it is less than the search key and any key after it is greater than or equal to the search key.

Follow that pointer and proceed in the same way with the keys at that node until the search key is found, or are at a leaf and the search key is not in the leaf.

2216

393533191852

126

A B‐Tree of Order 1001

3/12/2015

22

127

B‐Tree Insertion Case 1:A key is placed in a leaf that still has some room

39353395

2216

1918

Shift keys to preserve ordering & insert new key.

Insert 7

393533975

2216

1918

128

B‐Tree Insertion Case 2:                 A key is placed in a leaf that is full

3935339752

2216

1918

Split the leaf, creating a new leaf, and move half the keys from full leaf to new leaf.

Insert 8

39353352

2216

191897

129

B‐Tree Insertion: Case 2

39353352

2216

1918

Move median key to parent, and add pointer to new leaf in parent.

Insert 8

97

39353352

22167

191898

130

B‐Tree Insertion: Case 3              The root is full and must be split

39353352

4022167

1918

In this case, a new node must be created at each level, plus a new root. This split results in an increase in the height of the tree.

Insert 15

141298 595543

131

B‐Tree Insertion: Case 3              The root is full and must be split

39353352

4022167

1918

Insert 15

98 1412

39353352

127

191898 1514

4022Move 12 & 16 up

132

B‐Tree Insertion: Case 3               

16

This is the only case in which the height of the

B-tree increases.

39353352

127

191898 1514

4022

3/12/2015

23

133

B+‐Tree          

39353352

33188

1918

A B+-Tree has all keys, with attached records, at the leaf level. Search keys, without attached records, are duplicated at upper levels. A B+-tree also has links between the leaves.

141298

42

134

Application: Web Search Engine

A web crawler program gathers information about web pages and stores it in a database for later retrieval by keyword by a search engine such as Google.

• Search Engine Task: Given a keyword, return the list of web pages containing the keyword.  

• Assumptions:– The list of keywords can fit in internal memory, but the list of webpages (urls) for each keyword (potentially millions) cannot.

– Query could be for single or multiple keywords, in which pages contain all of the keywords, but pages are not ranked.

What data structures should be used?

Summary

• AVL trees, 2-3 trees, 2-3-4 tress, red-black trees, and B-tree are all balanced trees, with O(log(N)) heights.

• 2-3-4 trees and red-black trees have one-to-one correspondence.

• AVL trees and red-black trees are special binary search trees with simple data structure.

• B-trees with large of number of branches have small height and are suitable for storing large data sets on slow disks.

136

External Sorting

Problem: If a list is too large to fit in main memory, the time required to access a data value on a disk dominates any efficiency analysis.

1 disk access ≡ Several millionmachine instructions

Solution: Develop external sorting algorithms that minimize disk accesses

137

Basic External Sorting Algorithm

• Assume unsorted data is on disk at start

• Let M = maximum number of records that can be stored & sorted in internal memory at one time

AlgorithmRepeat:

1. Read M records into main memory & sort internally.

2. Write this sorted sub‐list onto disk. (This is one “run”).

Until all data is processed into runs

Repeat:

1. Merge two runs into one sorted run twice as long 

2. Write this single run back onto disk

Until all runs processed into runs twice as long

Merge runs again as often as needed until only one large run:  the sorted list

138

Basic External Sorting 

11 96 12 35 17 99 28 58 41 75 159481

Unsorted Data on Disk

Assume M = 3 (M would actually be much larger, of course.) First step is to read 3 data items at a time into main memory, sort them and write them back to disk as runs of length 3.

11 9481

9612 35

17 9928

5841 75

15

3/12/2015

24

139

Basic External Sorting 

Next step is to merge the runs of length 3 into runs of length 6.

11 9481 9612 35

17 9928 5841 75

1511 9481

9612 35

17 9928

5841 75

15140

Basic External Sorting 

Next step is to merge the runs of length 6 into runs of length 12.

11 9481 9612 3517 9928 5841 75

15

15

11 9481 9612 35

17 9928 5841 75

141

Basic External Sorting 

Next step is to merge the runs of length 12 into runs of length 24. Here we have less than 24, so we’re finished.

11 9481 9612 3517 9928 5841 7515

11 9481 9612 3517 9928 5841 75

15

2‐Way Sort: Requires 3 Buffers

• Pass 1: Read one page, sort it, write it.

– one buffer page used

• Pass 2, 3, …, etc.:

– three buffer pages used.

Main memory buffers

INPUT 1

INPUT 2

OUTPUT

DiskDisk

Two‐Way External Merge Sort

• Each pass we read + write each page in file.

• N pages in the file => the number of passes

• So toal cost is:

• Idea: Divide and conquer: sort subfiles and merge

log2 1N

2 12N Nlog

Input file

1-page runs

2-page runs

4-page runs

8-page runs

PASS 0

PASS 1

PASS 2

PASS 3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,62,6 4,9 7,8 1,3 2

2,34,6

4,7

8,91,35,6 2

2,3

4,46,7

8,9

1,23,56

1,2

2,3

3,4

4,56,6

7,8Assuming one page has 2 records

General External Merge Sort

• To sort a file with N pages using B buffer pages:

– Pass 0: use B buffer pages. Produce              sorted runs of B pages each.

– Pass 1, pass 2, …,  etc.: merge B‐1 runs. 

N B/

B Main memory buffers

INPUT 1

INPUT B-1

OUTPUT

DiskDisk

INPUT 2

. . . . . .. . .

More than 3 buffer pages. How can we utilize them?

3/12/2015

25

Cost of External Merge Sort

• Number of passes:

• Cost = 2N * (# of passes)

• E.g., with 5 buffer pages, to sort 108 page file:

– Pass 0:                   = 22 sorted runs of 5 pages each (last run is only 3 pages) 

– Pass 1:                 = 6 sorted runs of 20 pages each (last run is only 8 pages)

– Pass 2:  2 sorted runs, 80 pages and 28 pages

– Pass 3:  Sorted file of 108 pages

1 1 log /B N B

108 5/

22 4/

Number of Passes of External Sort

N B=3 B=5 B=9 B=17 B=129 B=257100 7 4 3 2 1 11,000 10 5 4 3 2 210,000 13 7 5 4 2 2100,000 17 9 6 5 3 31,000,000 20 10 7 5 3 310,000,000 23 12 8 6 4 3100,000,000 26 14 9 7 4 41,000,000,000 30 15 10 8 5 4