63
91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters” that Mr. Hollerith got you to buy... If you have a “card sorter”, how do you sort? WHAT is a card sorter? A machine that takes a deck of cards and has 26+ bins (one for each letter of the alphabet + blank + digits) to stuff the cards in. How does it do it?

91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Embed Size (px)

Citation preview

Page 1: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

91.102 - Computing II

Sorting.

How can we sort efficiently?

You are in charge of the 1890 census and you just took delivery of one of the new “card sorters” that Mr. Hollerith got you to buy...

If you have a “card sorter”, how do you sort?

WHAT is a card sorter?

A machine that takes a deck of cards and has 26+ bins (one for each letter of the alphabet + blank + digits) to stuff the cards in.

How does it do it?

Page 2: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

1) Put the deck in at one end. The last character of the “key” determines the bin the card ends up in. At the end, you have 26+ small decks, each with the same last symbol in the “key”.

2) Collect up the little decks, being careful not to mess the order, and pass the total deck through the sorter again, sorting on the “next-to-last” symbol.

3) Collect up the little decks, and repeat the process until you have sorted by the first symbol of the key.

4) Done…

91.102 - Computing II

Page 3: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

With n items and a key m characters long, you have just carried out m*n “comparisons”. If n gets larger, m will remain the same as long as you don’t change the length of the key.

Not bad…O(m*n) ~ O(n), since m is constant...

91.102 - Computing II

Page 4: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

This method is called “Radix Sorting” and is quite efficient (O(n)) - as long as you know some kinds of information about the data: in this case, the maximum length of the key, and the range of each key element.

If you don’t know much, what can you do? How efficient can you hope to be? O(n) for a set of n items? Worse?

Since you can't find the right place for something unless you have looked at it, you certainly won't expect anything better than O(n)...

91.102 - Computing II

Page 5: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

91.102 - Computing II

Sorting Themes

Comparison Based Address Calculation

TranspositionSorting

Insert andKeep Sorted

Priority QueueSorting

Divide andConquer

DiminishingIncrement

Sorting

BubbleSort

InsertionSort

TreeSort

SelectionSort

HeapSort

QuickSort

MergeSort

ShellSort

ProxMapSort

RadixSort

Page 6: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Lower Bound on the Cost of Sorting by Comparisons.

In how many ways can n items be presented?

There are n ways in which you can choose the first item, n - 1 ways for the second, n - 2 for the third, etc…

Total number of different ways:

n*(n - 1)*(n - 2)*…*3*2*1 = n!

There are n! ways in which a sorted set could be scrambled - so there are n! possible ways in which our set could arrive to us, to be sorted.

91.102 - Computing II

Page 7: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Comparisons are “binary” operations: something is or is not less than something else (we need this, and not just “equal” or “not equal” - which is also a binary operation). If we think of a comparison as providing us with the decision of “going left” or “going right” in a binary three, we can think of sorting as choosing the path from the root (the incoming set) to the leaf (the sorted set) in a binary tree.

How many levels must a binary three with n! leaves have (i.e., how many comparisons - at least - must we be expected to carry out? - one comparison per level). The answer is: log2(n!) - remember that each level going UP the tree has half the number of nodes as the lower level.

91.102 - Computing II

Page 8: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

It is not too hard to prove that:

log2(n!) ~ n log2(n)

This can be done via a formula called Stirling’s Approximation, usually proven in a Calculus II course.

What we have just shown is that “sorting by comparison” is, at best, an O(n log(n)) process… more precisely an (n log(n)) process. No O(n) for us down this path.

Unfortunately, we have NOT YET provided any O(n log(n)) algorithm to sort… Let’s go find us some...

91.102 - Computing II

Page 9: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

There are a number of methods we can develop. Let’s start with some based on “priority queues”.

We will examine, primarily, methods that work when the whole data set can be kept in memory and for which the amount of extra space needed is small. Since one “desirable” property of a sorted set is that of quick search, and the fastest reliable search method we know is “binary search” we will expect that our sorted set will be kept in an array (hashing anyone?).

We will start with the set (sorted or unsorted - we just have no way of knowing) in an array, say from 1 to n.

5 6 3 2 7 1 9 8 4 0

Sort it in ascending order.

91.102 - Computing II

Page 10: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

5 6 3 2 7 1 9 8 4 0

We are going to divide this array into TWO parts: one will be the “priority queue” part, and the other will be the “sorted” part.

To begin with, the “priority queue” part must go from index 1 to index n, while the “sorted” part will be empty.

5 6 3 2 7 1 9 8 4 0

Priority Queue

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10Index

Value

Page 11: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

ExtractMax(PQ) = 9, or: Select the Largest (using, for example, a function from Selection Sort), swap it with the LAST entry and reduce the size of PQ by 1:

5 6 3 2 7 1 0 8 4 9

Priority Queue

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10Index

Value

5 6 3 2 7 1 9 8 4 01 2 3 4 5 6 7 8 9 10Index

ValueOriginal

New

Page 12: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

5 6 3 2 7 1 0 8 4 9

Priority Queue

The maximum extracted (9) in the free array position:

Sorted

ExtractMax = 8; Select the largest again and swap with last. Reduce the Queue by 1.

5 6 3 2 7 1 0 4 8 9

Priority Queue Sorted

Repeating the process until the priority queue is empty, leaves us with a sorted array occupying the same space as the original one.

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10

Page 13: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Cost?

Each ExtractMax looks at all the remaining elements of the PQ: n on the first pass, n - 1 on the second, etc, plus it has to perform a swap and some housekeeping. The cost is then, (n - 1) + … + 2 + 1 comparisons to find all the largest elements and (n - 1) swaps. Adding them, we get n(n - 1)/2 from the comparisons and n - 1 from the swaps - O(n2) in total… This is not too good, since it means that doubling the number of items to be sorted will require quadrupling the time it takes to sort them… 1,000 times as many items, 1,000,000 times as long… bummer…

Can we do better? Well, there is another way of managing a Priority Queue - a Heap… that might be better.

91.102 - Computing II

Page 14: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

The requirement is that we first make a Heap out of the array - how cheap is that?

5 6 3 2 7 1 9 8 4 0

Think of this as being a complete binary tree. Start at n/2, and look at n, n+1 (the children of n/2 in the heap). If the item at n/2 is smaller than either of its children, swap it with the larger of the two.

Since 7 > 0 and array position 11 doesn’t exist, leave everything as it is.

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10

Page 15: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Move up to n/2 - 1. This is position 4 with children at positions 8 and 9. Since 8 > 2, and 8 > 4, swap the 2 and the 8 to get:

91.102 - Computing II

5 6 9 8 7 1 3 2 4 01 2 3 4 5 6 7 8 9 10

Since index 8 has no children, stop. Move up one more position: 3. Children at 6 and 7. 9 > 3, and 9 > 1. Leave everything as is.

5 6 9 8 7 1 3 2 4 01 2 3 4 5 6 7 8 9 10

5 6 3 2 7 1 9 8 4 01 2 3 4 5 6 7 8 9 10

Page 16: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Now to position 2, with children at 4 and 5: 8 > 6, and 8 > 7. Swap 6 and 8:

5 8 9 6 7 1 3 2 4 0

We have to check that the item swapped into position 4 satisfies the heap property with respect to its children at 8 and 9: since 6 > 2 and 6 > 4, everything is OK.

To position 1. 5 < 8, 5 < 9. Swap 5 and 9.

91.102 - Computing II

5 6 9 8 7 1 3 2 4 01 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

Page 17: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

9 8 5 6 7 1 3 2 4 0

Notice 5 is larger than either of its children - at 6 and 7. Stop. The next index on the left would be 0: we have a heap.

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10

Page 18: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

What did it cost? We started from the middle, so we repeated the process only floor(n/2) times. Each time we compared an item with its immediate descendants, and then (maybe) with a next pair of descendants, and so on, until we had no more descendants. Since we are looking at a complete binary tree, the number of descendants of each item we look at will never be more than 2*log2(n) (two for the root, two for the chosen child, two for the chosen grandchild, up to a maximum of log2(n) generations of descendants), so the worst case for constructing a heap is O(n log2(n)) - one can actually show that it is better than that, but we don’t need to.

91.102 - Computing II

Page 19: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

We know that ExtractMax will have cost O(log2(n)) - since all we need is to rebuild a heap. Let’s take a look:

91.102 - Computing II

9 8 5 6 7 1 3 2 4 01 2 3 4 5 6 7 8 9 10

ExtractMax = 9; the 0 in position 10 is moved to position 1 and “percolates” to its correct position:

Page 20: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

0 8 5 6 7 1 3 2 4

8 0 5 6 7 1 3 2 4

8 7 5 6 0 1 3 2 4Heap Free

8 7 5 6 0 1 3 2 4 9Heap Sorted

Repeat: ExtractMax = 8;

4 7 5 6 0 1 3 2 9To Heap Sorted

7 4 5 6 0 1 3 2 9

7 6 5 4 0 1 3 2 9

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10

Page 21: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

7 6 5 4 0 1 3 2 8 9

Heap Sorted

ExtractMax = 7; etc… Continue until done…

What is the cost of this part? We must extract n elements, and reconstructing the heap each time costs

O(log2(size_of_heap)) ≤ O(log2(n)).

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10

Page 22: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Total cost:

O(n log2(n)) (to make the heap)

+ O(n log2(n)) (to sort with the Priority Queue method)

= O(n log2(n)).

Looks better, but more complicated (law of the no free lunch…).

91.102 - Computing II

Page 23: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

What other methods can we cook up? If we can split the array into two roughly equal parts, we might be able to sort both parts separately and “glue” the results together. A requirement would be that the “glue” be cheap: little or no extra work… Can we do it?

5 6 3 2 7 1 9 8 4 0

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10

This (just splitting) won't quite do it - but maybe a variant will… after all, if you split in half enough times, you get subarrays with just one element, and ALL sets with just one element are SORTED. The question becomes: do we do the work before we split or after?

Page 24: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Let's try doing the work before:

Move all the “small” elements into the “bottom half” and the large ones into the “top half” :

1 4 3 2 0 5 9 8 6 7

Notice that we have “just done it” - no algorithm given - without requiring that anything be sorted. Repeat the process with the left half and with the right half:

1 2 3 4 5 6 7 8 9 10

91.102 - Computing II

Page 25: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

1 0 3 2 4 5 6 8 9 7

0 1 2 3 4 5 6 7 9 8

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9How many times did we divide the set? O(log2(n)), since we halved the size of the set on each pass. Unfortunately we did something else: move all the small items into the left half and all the large ones into the right half. This will cost O(n) on each pass...

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10

Page 26: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Total Cost O(n log2(n)) - not bad, EXCEPT: we have no algorithm to decide on how to move the small items left and the large ones right. What is “small”? We don’t know what is coming in, and we would like to avoid a pass through the array just to find out - and it may not be too useful anyway.

Small: pick an “arbitrary” element of the array: small is everything smaller than it, large is everything larger than it. Split the array into two parts based on this arbitrary choice. Repeat as needed, until you have arrays of 1 element - which are sorted by definition.

91.102 - Computing II

Page 27: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Arbitrary: element in first position (any fixed, or random choice will do: the text chooses the MIDDLE element of the array).

There is NO BEST CHOICE… You may want to choose a random index position at each choice point: the problem is that you may use more computational power to look for a good pivot than you save by the choice...

91.102 - Computing II

Page 28: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

5 6 3 2 7 1 9 8 4 0

Pivot

Start incrementing indices from the left (i) as long as you have elements LESS than the pivot (you stop immediately here).

Start decrementing indices from the right (j) for as long as you have elements GREATER THAN the pivot (you stop immediately here too).

If i ≤ j swap:

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10

5

Page 29: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

0 6 3 2 7 1 9 8 4 5

Increment i, decrement j, and go back to the top. Notice that i is 2 and j is 9: the inequalities are again right for a swap.

i j

1 2 3 4 5 6 7 8 9 10

91.102 - Computing II

0 6 3 2 7 1 9 8 4 5

i j

1 2 3 4 5 6 7 8 9 10

5

5

Page 30: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

0 4 3 2 7 1 9 8 6 5

i j

0 4 3 2 7 1 9 8 6 5

i j

0 4 3 2 1 7 9 8 6 5

j iStop...

0 4 3 2 1 7 9 8 6 5

Repeat on each of the subarrays...

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10

5

Page 31: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

91.102 - Computing II

How about some code?

void QuickSort(SortingArray A, int m, int n)

// to sort the subarray{ // A[m:n] of array A into ascending order int i, j;

if (m < n) { i = m; j = n; // Initially i and j point to the // first and last items Partition(A,&i,&j);// partitions A[m:n] into A[m:j] // and A[i:n] QuickSort(A,m,j); QuickSort(A,i,n); }}

Page 32: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

91.102 - Computing II

void Partition(SortingArray A, int *i, int *j){ KeyType Pivot, Temp;

Pivot = A[ ( *i + *j ) / 2 ] ;// middle key as pivot do { while (A[*i] < Pivot) (*i)++; // Find leftmost i such that A[*i] >= Pivot. while (A[*j] > Pivot) (*j)--; // Find rightmost j such that A[*j] <= Pivot.

if (*i <= *j) { // if i and j didn't cross, swap Temp = A[*i]; A[*i] = A[*j]; A[*j] = Temp; // A[*i] and A[*j] (*i)++; // move i one space right (*j)--; // move j one space left } } while (*i <= *j); // while i and j not crossed yet}

Page 33: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

What is the down side? Sort:

0 1 2 3 4 5 6 7 8 9

Pivot0 1 2 3 4 5 6 7 8 9

Pivot1 2 3 4 5 6 7 8 9

Pivot

The process is repeated NOT log2(n) times but n times…

The size of the sets to be partitioned goes down by 1 at each level, and we have lost the O(n log2(n)) sort…

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10

Page 34: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

This method, called QuickSort, does all the work at the time of partitioning the original set into two “equal size” - you hope - subsets. If the data come in skewed - so that the sets of the partitions are no longer “roughly equal” (I haven’t defined this, and it is NOT as bad as it sounds) - the “quick sorting” becomes O(n2) and you might as well use other methods.

91.102 - Computing II

Page 35: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Since the partitioning phase is where QuickSort is vulnerable, is there some other way of “dividing and conquering” where the “divide” is reliable?

Divide in half BEFORE you do any work - so there is no chance that you could mess up the division - and leave the work for later. The ideal algorithm of the procrastinator! (There IS value in the procrastinator’s approach: if you can afford to wait, many problems go away by themselves...)

91.102 - Computing II

Page 36: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

MergeSort. We will see that this algorithm is also good for data sets that are too large to fit in memory. It was originally devised when the only large scale memory available was the external TAPE...

5 6 3 2 7 1 9 8 4 0

5 6 3 2 7 1 9 8 4 0

5 6 3 2 7 1 9 8 4 0

Split in half

Split in half

5 6 3 2 7 1 9 8 4 0

Split in half

5 6 3 2 7 1 9 8 4 0

Split in half

91.102 - Computing II

Page 37: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

We are all split up. Now what?

Note that all sets of cardinality 1 are sorted - by definition. We started with an unsorted set and we are finishing with n (= 10) sorted ones. The idea: put them back together, two at a time, keeping the sorted property.

5 6 3 2 7 1 9 8 4 0

2 7 0 4

5 6 2 3 7 1 9 0 4 8

Let’s see exactly how the algorithm works: merge

5 6 2 3 7

91.102 - Computing II

Page 38: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

5 6 2 3 7

Start with two indices, referring to the beginning positions of the two sorted arrays (I[j], II[k]). Allocate an empty array III[l] of size equal to the sum of the sizes of I and II. Since II[k] < I[j], insert the contents of II[k] into III[l] and increment k and l.

j k

I II

III

l

5 6 2 3 7

2

j k

I II

III

l

Since II[k] < I[j], repeat the copy of II[k] into III[l] and increment k and l.

91.102 - Computing II

Page 39: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

5 6 2 3 7

2 3

j k

I II

III

l

Now I[j] < II[k], so copy I[j] into III[l] and increment j and l.

5 6 2 3 7

2 3 5

j k

I II

III

l

Now I[j] < II[k], so copy I[j] into III[l] and increment j and l.

5 6 2 3 7

2 3 5 6

j k

I II

IIINow j is out of range for I, so just copy the rest of II into III...

l

91.102 - Computing II

Page 40: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

5 6 2 3 7

2 3 5 6 7

j k

I II

IIIDone!!!

l1 9 0 4 8

Repeat for the other half:

0 1 4 8 9

And now repeat with the two original halves:

2 3 5 6 7 0 1 4 8 9

91.102 - Computing II

Page 41: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

void MergeSort(ListType List){ if (the List has more than one item in it) { (break the List into two half-lists, L = LeftList and R = RightList) (sort the LeftList using MergeSort(L)) (sort the RightList using MergeSort(R)) (merge L and R into a single sorted List) } else { (do nothing, since the list is already sorted) }}

91.102 - Computing II

Page 42: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Cost?

In TIME: since “splitting in half” occurs O(log2(n)) times, we are going to have that many “levels of merging” before we are back up with the fully merged array. At each level, the merging will position one element correctly for every comparison - so O(n) comparisons (and swaps) per level. Total O(n log2(n)).

Since the splitting decision did NOT depend on chance, there is no way this scheme can go wrong… except…

91.102 - Computing II

Page 43: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Cost?

In SPACE: the mergings seem to require a NEW array of size equal to the sum of the sizes of the arrays to be merged: drats! Twice as much space. And I thought I was going to get a free lunch (or at least a cheap one)!

If we used linked lists, we would not need as much space… but that would complicate our “splittings”, and the subsequent search: again, no free lunch.

91.102 - Computing II

Page 44: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

One possibility for space reduction observes that the two parts of the array are contiguous, copies the first half into a new array (only half the size) and frees enough space to carry out the procedure using the old array as the target.

We use only 50% more space; need more time to copy into the array before we merge; gain a little time (probabilistically), since the tail end - if nonempty - of the second part does not need to be copied at all.

91.102 - Computing II

5 6 2 3 7j k

I II

IIIl

2 3 7

5 6

l k

I II

IIIj

2 3 7

5 6

l k

I II

IIIj

….

Page 45: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

91.102 - Computing II

Another possibility for space reduction would use a linked-list based queue. Elements of the first subarray about to be overwritten are enqueued and then overwritten. The comparison between head of the first and second subarray uses the queue to obtain the current head of the first subarray (unless the queue is empty).

In this case, it would be important to have the queue implemented so that calls to malloc and free are minimized (the queue maintains its own free list).

This method is likely (probabilistically) to use less space (about 1/2 for large data items - why?) than the previous one, and to require fewer copies to be made. Even more complex to code…

Page 46: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Why is this method good for “external sorting”?

Take three tapes: A, B, C. The sets on A and B are sorted in increasing order, C is empty.

1) read the first items from A and from B.

2) Compare the item from A with the item from B.

3) Copy the smaller item into C starting at the beginning of empty space.

4) Read a new item from the tape that provided the smaller item.

5) if there is a new item go back to 2), otherwise

6) Copy all the unread items in the remaining tape to tape C.

7) Quit.

91.102 - Computing II

Page 47: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

91.102 - Computing II

If the two tapes A and B are unsorted, you need two more tapes, C and D.

a) Take the first two elements of A and B and sort them on C; take the next two elements of A and B and sort them on D;

b) Repeat a) until A and B have been read and copied. If one of the tapes ends before the other, just take two successive elements from the one stuill containing data, sort them and keep alternating the output until done. Now C and D have sorted PAIRS.

Page 48: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

c) C -> A; D -> B; swap the tapes.

d) Repeat the process with the first two elements of A and and the first two of B sorted into C; the next two pairs get sorted into D;

e) repeat with sets of cardinality 4, 8, 16, 32, etc… until done.

91.102 - Computing II

Page 49: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Notice that in External MergeSort we never needed to hold more than two items in memory at the same time.

Why don’t we use this method all the time?

Tapes are REALLY, REALLY slow…

How about disk?

Same problem: memory-to-memory copy takes on the order of nanoseconds, while disk-to-disk copy takes 10 milliseconds or more… a factor of 105 or worse.

91.102 - Computing II

Page 50: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

How about Linked Lists? After all, all you need to do there is move a pointer… no extra space.

Great! But fast searching still requires an array (or tree), so we need to convert from sorted linked lists to arrays (or trees): where do we get the space from?

You could copy the sorted list out to disk, and bring it back in as an array…

There is no point in bringing it in as a tree, since this would require building a balanced binary search tree from a sorted list… you might as well have built the balanced binary search tree to begin with - it would have been cheaper (why?).

91.102 - Computing II

Page 51: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Insertion Sort.

Into an array.5 6 3 2 7 1 9 8 4 0

As yet unsorted

Remove the first item in the unsorted portion of the array and insert it into the sorted portion so that it will remain sorted. Notice the removal freed a position at the end of the sorted portion:

5 6 3 2 7 1 9 8 4 0

As yet unsortedSorted

6 3 2 7 1 9 8 4 0

91.102 - Computing II

Since the initially sorted portion was empty, inserting this first element into the freed position will provide a sorted array of length 1:

Page 52: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

5 3 2 7 1 9 8 4 0

As yet unsortedSorted

Remove the next element from the unsorted subarray.

Insert it - comparing from the beginning - into the sorted subarray, making sure that it remains sorted:

5 6 3 2 7 1 9 8 4 0

As yet unsortedSorted

Repeat: Remove Compare and Insert until done.

5 6 2 7 1 9 8 4 0

Sorted As yet unsorted

3 5 6 2 7 1 9 8 4 0

Sorted As yet unsorted

91.102 - Computing II

Page 53: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Cost:

With no elements in the sorted part: 0 comparisons, 0 moves, 1 insertion.

With 1 element: 1 comparison and either 1 move and 1 insertion or 1 insertion (the new item goes AFTER the existing one).

With k elements: at least k (comparisons + moves) and one insertion.

To manage all n elements, at least:

1 + 2 + … + n = n(n + 1)/ 2 ~ O(n2) operations.

It’s easy to code, though….

91.102 - Computing II

Page 54: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

TreeSort: Insert the incoming data into a binary search tree. The tree could rely on randomness of input or be one of those guaranteed to remain balanced (AVL, Red-Black, etc.). In either case, each insertion takes time (probabilistically or deterministically) proportional to log2(k), where k is the number of items already in the tree.

log2(1) + log2 (2) + … + log2 (k) + … + log2 (n)

~ n log2 (n).

Downside: the probabilistic search tree can become a list with each insertion costing time proportional to n - total O(n2). The AVL trees are fairly complex to code up correctly and have more overhead...

91.102 - Computing II

Page 55: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

ShellSort:

The idea: things that are far apart will probably remain apart (this may or may not be true).

5 11 3 12 7 1 10 8 4 0 2 9 1314 60 1 2 3 4 5 6 7 8 9 1011121314

ValueIndex

So: first move things that are far apart, and then move things that are closer together.

Start with an “increment” that is fairly large, say 5.

5 11 3 12 7 1 10 8 4 0 2 9 1314 60 1 2 3 4 5 6 7 8 9 1011121314

Indices of the same color denote subarrays that will be sorted by means of an InsertionSort algorithm.

91.102 - Computing II

Page 56: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

5 11 3 12 7 1 10 8 4 0 2 9 1314 60 1 2 3 4 5 6 7 8 9 1011121314

Sort positions

0, 5, 10;

then 1, 6, 11;

etc..

1 11 3 12 7 2 10 8 4 0 5 9 1314 60 1 2 3 4 5 6 7 8 9 10111213141 9 3 12 7 2 10 8 4 0 5 111314 60 1 2 3 4 5 6 7 8 9 10111213141 9 3 12 7 2 10 8 4 0 5 111314 60 1 2 3 4 5 6 7 8 9 10111213141 9 3 4 7 2 10 8 12 0 5 111314 60 1 2 3 4 5 6 7 8 9 10111213141 9 3 4 0 2 10 8 12 6 5 111314 70 1 2 3 4 5 6 7 8 9 1011121314

Change the “increment” to something smaller, say 3: choosing a sequence of increments that are mutually relatively prime should avoid repetitions of work - hopefully.

91.102 - Computing II

Page 57: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

1 9 3 4 0 2 10 8 12 6 5 111314 70 1 2 3 4 5 6 7 8 9 10111213141 9 3 4 0 2 6 8 1210 5 111314 70 1 2 3 4 5 6 7 8 9 10111213141 0 3 4 5 2 6 8 1210 9 111314 70 1 2 3 4 5 6 7 8 9 10111213141 0 2 4 5 3 6 8 7 10 9 111314120 1 2 3 4 5 6 7 8 9 1011121314

Now increment by 1: 1 0 2 4 5 3 6 8 7 10 9 11131412

0 1 2 3 4 5 6 7 8 9 1011121314Since the elements are now nearly sorted, it makes sense to perform the insertion “from the back”: check if the new element, from the unsorted part, is LARGER than the last element of the sorted part - if true, ONE comparison and no swaps are enough...

91.102 - Computing II

Page 58: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

void ShellSort(SortingArray A){ int i, delta; delta = n; // pick a value - length of array?.

do { delta = 1 + delta / 3;

for (i = 0; i < delta; ++i) { DeltaInsertionSort(A,i,delta); } } while (delta > 1);}

91.102 - Computing II

Page 59: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

91.102 - Computing II

void DeltaInsertionSort(SortingArray A, int i,int delta){ int j,k; KeyType KeyToInsert; bool NotDone; j = i + delta; while (j < n) { // obtain a new KeyToInsert KeyToInsert = A[j];// move each Key > KeyToInsert rightward by delta spaces // to open up a hole in which to place the KeyToInsert k = j; NotDone = true; do { if (A[k - delta] <= KeyToInsert) { NotDone = false; } else { A[k] = A[k - delta]; k -= delta; if (k == i) NotDone = false; } } while (NotDone);

Page 60: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

//Continue...// put KeyToInsert in hole A[k] opened by moving// keys > KeyToInsert rightward A[k] = KeyToInsert;

// consider next KeyToInsert at an increment of delta // to the right j += delta;

}}

91.102 - Computing II

Page 61: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Analysis (i.e., cost): after more than 40 years, there is no complete theoretical study of this sorting method. Lots of people have tried, nobody has succeeded except in some special cases. Empirical studies indicate a time complexity of O(n1.25).

There are other sorts - one of the early ones is known as BubbleSort: the elements just “bubble up” to their proper position. Easy to code, bad for efficiency.

91.102 - Computing II

Page 62: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

Some time comparisons.

Array SizeQuickSortHeapSortShellSort

BubbleSortInsertionSSelectionSMergeSort

O(n log2(n))O(n log2(n))

O(n1.25)O(n2)O(n2)O(n2)

O(n log2(n))

0.400.61

64

0.422.761.121.400.99

0.981.43

128

1.0411.364.475.562.28

2.223.28

256

2.3746.4217.5822.185.13

4.947.43

512

5.44189.3569.8988.6611.45

10.8616.57

1024

11.97766.22280.27354.4825.11

Some other considerations:

size of the data set vs. ease of coding;

What are you comparing? Integers, floats, strings, playing cards?

91.102 - Computing II

Bound

Page 63: 91.102 - Computing II Sorting. How can we sort efficiently? You are in charge of the 1890 census and you just took delivery of one of the new “card sorters”

91.102 - Computing II

The Radix Sort we introduced at the beginning is O(n) - deterministically so. If you have enough information about the incoming data, this may be preferable to any other method…

If you are confident you can always choose a decent pivot, use QuickSort.

If you are convinced that a malevolent demon will always send you the worst possible data set for your sorting scheme, and the data set is large, then use HeapSort or MergeSort (if you have lots of extra memory - or too little, and so have to go 'external').

If you have a small data set, use Insertion Sort.

For a medium size data set, reasonably scrambled, you could use ShellSort.