Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Sorting Algorithms Analysis 1
An Analysis of the Pragmatic Differences in How the Insertion Sort, Merge Sort, and Quicksort Algorithms Comparison Sort Numbers
Stored in the Common Java-language Array Data Structure
Gina Soileau, Muhammad Younus, Suresh Nandlall,
Tamiko Jenkins, Thierry Ngouolali, Tom Rivers
Data Structures and Algorithms (SMT-274304-01-08FA1)
Professor James Iannibelli
December 21, 2008
Sorting Algorithms Analysis 2
Introduction
Data structures and the algorithms that utilize them are foundational to the study of
computer programming. The ability to determine what particular combination of both will
achieve maximum efficiency is an essential skill that one must master in order to be an effective
programmer. This course was designed to expose students to a wide variety of data structures
and algorithms, as well as the situations in which their proper implementation can be leveraged
to yield the desired results. Towards that end, our team has been tasked with developing a final
project that analyzes the practical application of three different sorting algorithms on a specific
data structure.
This paper is the culmination of that group effort and seeks to explore the topic in detail
as viewed through the lens of our team project. It discusses the team project proposal and the
rationale behind the team’s choices with respect to the project’s targeted data structure and
algorithms in order to provide a comprehensive analysis of the performance metrics obtained.
The Project Plan
The team project involved implementing and testing the efficiency of three sorting
algorithms using a particular data structure. As we have learned in this course, every sorting
algorithm has its own strengths and weaknesses. With that in mind, we planned to vary,
deliberately, not only the size of the data to be sorted but also the manner in which it was
arranged to ensure we would be as thorough as possible. This meant we needed to test random
arrangements of data as well as pre-sorted data, both in ascending and descending sequence.
Through these series of tests, we intended to ascertain which sorting algorithms worked
best with different data sets. Data sets of varying size and order would be created and the chosen
sorting algorithms would be performed on each. To assess the performance of each algorithm,
Sorting Algorithms Analysis 3
we would measure the sort times for the varying data sets and compare those results against the
known efficiencies of the three sorts. With the basic plan in place, we then turned our attention
to the selection of three sorting algorithms and a data structure.
Among the many sorting algorithms presented in this course, we chose to use Insertion
Sort, Quicksort, and Merge Sort. Insertion Sort was our first choice because it is straightforward
and easy to both understand and implement. As Drake (2006) notes, this is the kind of algorithm
many people use when sorting a hand of cards. Since our task was to compare and contrast
different sorting algorithms, it made sense to include it as a baseline because it provided a way
for us to measure the difference in efficiency between a simple sorting algorithm and one that is
more advanced. According to Lafore (2003), Insertion Sort has both an average and worst-case
running time of O(log n2).
Our next choice was Quicksort. Lafore (2003) states that Quicksort is the most popular
sorting algorithm because, in the majority of situations, it is the fastest: operating in O(n log n)
time. However, as Drake (2006) points out, Quicksort can have a worst-case running time that is
quadratic: O(log n2). This made it an ideal candidate for our project not only because it is so
popular, but also because its performance can be degraded from the best average running time to
no better than the best-case performance of Insertion Sort.
Finally, we settled on Merge Sort as our final sorting algorithm because it has the
distinction of always running in O(n log n) time (Lafore, 2003, p. 725). This sets Merge Sort
apart from Quicksort because it is just as fast in both its average and worst-case scenarios
whereas Quicksort is not. With Insertion Sort as our baseline and Quicksort taking up the middle
position on the performance scale, Merge Sort was a great candidate for rounding out our
coverage of various sorting algorithms.
Sorting Algorithms Analysis 4
Having chosen the data to sort and the algorithms to sort them, the last step was
determining the data structure. Throughout the course, we were exposed to many different data
structures, however one particular type of data structure stood out from the rest: arrays.
Primarily, we chose arrays because of their ubiquitous nature. This commonly used data
structure, populated with a variety of different data sets and sorted by three sorting algorithms
that span the running time performance scale, completed the perfect mix of elements for our
experiments.
Insertion Sort
The Insertion Sort algorithm has a very simple approach. Each value is examined in turn
and placed in the proper sequence with respect to all of the values that precede it. Like any other
algorithms, the Insertion Sort makes n – 1 passes through a sequence of n elements. On each
pass, it inserts the next element into the subarray on its left, thereby leaving that subarray sorted.
When the last element is inserted, the entire array is sorted. For example, consider an array of
integers as depicted in the top part of the diagram below:
Image 1. Insertion Sort.
(Drake, 2006, p. 212)
The process begins by selecting the second element in the array as the target value. This
is represented by the value of 5 in the top line of the diagram. Next, each value to the left of the
Sorting Algorithms Analysis 5
target value is examined in turn, continuing in that direction until the end of the array is reached.
If the value being currently examined is less than the target value, then it is allowed to remain
where it is. This is shown by the second line in the diagram where the target value of 5 is less
than the examined value of 2.
If the value being currently examined is not less than the target value, then it is shifted to
the right one position and the next value to the left is checked. Looking at the progression from
the second line of the diagram to the third, we see that the target value of 1 is first compared to
the value to its immediate left. Since the value of 5 is greater than our target value of 1, the 5 is
shifted one place to the right and ends up occupying the position of our target value. This is not
a problem, however, because we know that our target value should not be there anyway. We
then look at the next value to the left, which is 2. Again, our target value is less than that so we
shift the 2 over one position to the right to occupy the position recently vacated by the 5. After
the 2, there are no more values to examine so our target value is placed in the first position in the
array as shown in the third line of the diagram. This is because it has to be put somewhere and
that is the only place left available – conveniently for us, the 2 has already been shifted to the
right leaving us an open spot.
Once the last value in the array is examined and all shifts are completed, the next value in
the array to the right of the previous target value becomes the new target value and the process
begins again. Once there are no more target values to process, the array is sorted as shown in the
last line of the diagram.
Quicksort
Quicksort was created by C.A.R. Hoare in 1962 (Lafore, 2003, p. 333). Compared to the
Insertion Sort, Quicksort definitely takes sorting to a new level. It is a divide and conquer
Sorting Algorithms Analysis 6
algorithm that divides the data to be sorted into pieces, recursively sorts the pieces, and then
recombines the sorted pieces in order to produce a sorted array (Drake, 2006, p. 249). The basic
concept that powers this algorithm is outlined in the diagram below:
Image 2. Quicksort
(Drake, 2006, p. 240)
The first line represents the unsorted array of values. As the comment below it indicates,
the array is partitioned into two separate groups based on whether they are “small” or “large”.
This determination is based on the selection of one of the values in the array as a pivot. Once
selected, this pivot value serves as the basis for determining which of the other values are smaller
or larger. In the diagram above, the value of the last element of the original array has been
selected as the pivot and is shown as the shaded element in the second line. After moving all
values smaller than the pivot to the left and all values larger to the right, recursion is employed to
sort each of the sub-arrays, left and right of the pivot, using the same methodology. The result is
a sorted array as seen in the last line of the diagram.
In order to understand better exactly how this algorithm works, let us walk through the
first pass of a Quicksort on an array of values using the diagram below:
Sorting Algorithms Analysis 7
Image 3. Quicksort pass
(Drake, 2006, p.241)
The first line indicates that the last element in the array has been chosen as the pivot.
Next, a pointer is constructed to keep track of what part of the array we have already sorted, and
is pointed at the first value we need to examine. In our example, this is the value of 3 indicated
in white on the second line. Now we begin testing values against the pivot value starting from
the left. Since 3 is less than our pivot value of 4, it must be swapped with the value pointed to by
our pointer. In this case, both the position of the value being examined and the position of our
pointer are the same, so the result of swapping them really does not do anything. We then
increment our pointer to the unsorted part of the array over one position to the right so that it
now points to the 8, focus our attention at the next value to examine (also the 8), and compare it
to the pivot. Since 8 is greater than 4, we don’t swap any values or increment our pointer, either
– we simply move one position to the right and compare this new value, the 6, to our pivot.
Once again, this value is larger than the pivot so we compare the next value, the 1, to the pivot.
Sorting Algorithms Analysis 8
Now we have a value that is smaller than the pivot so we need to do a swap. Remember that the
pointer is still pointing at the 8, so we will be swapping the 1 and the 8 as indicated by the
progression from line 2 to line 3 in our diagram. Of course, we must also increment our pointer
to point at the 6.
As we move from line 3 to line 4 of the diagram, we see that we have to move all the way
right to the 2 before we need to make another swap because all of the values in-between are
larger than the pivot. At that stage, our pointer is still pointing at the 6 from our last swap, so the
2 is swapped with the 6 and the pointer is moved to the 8 as shown on line 4. After inspecting
the remaining numbers, there are none left to examine that are larger than the pivot so no more
swaps are required. The final step, however, requires us to move our pivot value to the last
position indicated by our pointer like the last two lines of our diagram show.
Of course, the array isn’t sorted after this pass, but that’s where the recursive part of the
algorithm comes into play. Our pointer is still pointing at where the 4 is on the last line of the
diagram used in this example so we know that all the values to its left are smaller and all the
values to its right are larger. We can now call the same routine two more times, once with the
array of values to the left of our pointer and once with the values to the right, and process those
sub-arrays just like we did with the entire array in our first pass. As these sub-arrays are
manipulated, they become smaller and smaller, until there is nothing left to process. When that
occurs, our array is finally sorted.
Merge Sort
Merge Sort was invented in 1945 by John von Neumann (Wikipedia, 2008). Like
Quicksort, Merge Sort is a divide and conquer algorithm. The diagram below gives an overview
of how it works:
Sorting Algorithms Analysis 9
Image 4. Mergesort
(Drake, 2006, p. 235)
Starting with an array of values, the algorithm begins by dividing it in half as shown in
the first two lines of the diagram. Recursive calls are made to continue dividing each piece in
half until the end result is an array of just a single value. Since such an array is considered
sorted, the next step is to merge all of the sorted pieces together into the final sorted array.
For example, let’s follow the left half of the example array on line 2 of the diagram as it
is processed. Since it contains more than one element, it is further divided into two arrays, the 3
and the 8 in the left half and the 6 and 1 in the right half. Again, we are dealing with arrays that
contain more than one value so we undergo the division process once more. This time, we end
up with the 3 and the 8 being divided in half into an array containing just the 3 on the left and an
array with just the 8 on the right. The same thing occurs with the array containing the 6 and 1.
The 6 ends up in the left array by itself and the 1 is placed in its own array on the right. This
leaves us with four single-value arrays that are considered sorted. Now the only thing left to do
is merge them together in the proper sequence.
The diagram below illustrates how separate sorted arrays are merged together:
Sorting Algorithms Analysis 10
Image 5. Mergesort merge
(Drake, 2006, p. 237)
The merging process requires another array that is the same size as the original array of
values to hold the final sorted result. The first step is to compare the first value in each of the
two arrays being merged. On line 1 of the diagram, the 1 is smaller so it is placed in the first
position of the result array. The shaded elements of arrays a and b represent values that have
already been merged and helps to show which two values will be compared next. Moving to line
2, the 2 from array b is smaller than the next value in array a, the 3, so the 2 is moved to the next
open position in the result array. This continues until all of the elements in each array are
exhausted. Once that occurs, the result array contains the sorted values.
Test Results
After constructing the arrays to sort and the code to sort them using our three sorting
algorithms, we ran the project code and gathered some test results. The sections that follow
Sorting Algorithms Analysis 11
cover how the algorithms performed with each of the three arrangements of data within our
target test arrays: pre-sorted in ascending order, pre-sorted in descending order, random order.
Results from our initial run as well as an additional test run using bigger arrays and more random
sequences are included.
Pre-Sorted Arrays in Ascending Sequence
This test was performed with two arrays containing 1000 and 6000 elements respectively.
The results from both runs are shown below:
Table 1. Sorting Algorithms’ Performance on an Array of
1000 Pre-sorted Numbers in Ascending Order 1000 Pre-sorted in Ascending
Order Algorithm Time (in ms)
Merge Sort 2.87 Quicksort 2.69
Insertion Sort 0.21
Graph 1. Sorting Algorithms’ Performance on an Array of
1000 Pre-sorted Numbers in Ascending Order
Sorting Algorithms Analysis 12
Table 2. Sorting Algorithms’ Performance on an Array of
6000 Pre-sorted Numbers in Ascending Order 6000 Pre-sorted in Ascending Order
Algorithm Time (in ms) Merge Sort 7.14 Quicksort 74.49
Insertion Sort 2.87
Graph 2. Sorting Algorithms’ Performance on an Array of
6000 Pre-sorted Numbers in Ascending Order
Notice that when the data being sorted is smaller, Quicksort and Merge Sort take
approximately the same amount of time to complete. However, when the data is larger the
difference between them is much more pronounced. This occurs because of the speed of modern
computers. Only when we force the processor to work hard do we begin to get meaningful
results.
The reason why Insertion Sort is the clear winner with this arrangement of data is
because it only needs to traverse the array – no actions need to be performed other than that.
More interesting, however, is the reason why Quicksort did so much worse than Merge Sort.
The answer lies in the choice of the Quicksort pivot value. In our project, the code for Quicksort
Sorting Algorithms Analysis 13
always uses the first element in the array being sorted as the pivot value. An array that is already
pre-sorted in ascending sequence will always have considerably more values larger than the first
value – if there are no duplicate values in the array then all values will be larger than the pivot.
This means that the inherent efficiency of breaking the original array into smaller (pivot-to-
median-dependant equidistant) ranges doesn’t materialize. While Quicksort has more
comparisons (approximately 1.39 NlogN in the average case) than Mergesort, the decreased
overhead of its simple comparison operations normally produces a faster result for a randomized,
non-duplicative list (Sedgewick & Wayne, 2007, p. 7). Mergesort takes 2N + O(logN) memory
while Insertion Sort takes N + O(1) (Sedgewick & Wayne, 2007, p.2). Insertion Sort, due to its
small overhead, is faster than both for very small (sub)arrays. Quicksort is highly sensitive to the
distance the pivot value remains from the median of all values, which, in addition to a random
number pool, will result in a fast time. A merge sort doesn’t use a pivot value so it isn’t affected.
Pre-Sorted Arrays in Descending Sequence
The next test we performed used arrays that were pre-sorted in descending order. The
results from both test runs are shown below:
Table 3.
Sorting Algorithms’ Performance on an Array of 1000 Pre-sorted Numbers in Descending Order
1000 Pre-sorted in Descending Order
Algorithm Time (in ms) Merge Sort 0.37 Quicksort 1.88
Insertion Sort 4.67
Graph 3. Sorting Algorithms’ Performance on an Array of 1000 Pre-sorted Numbers in Descending Order
Sorting Algorithms Analysis 14
Table 4. Sorting Algorithms’ Performance on an Array of 1000 Pre-sorted Numbers in Descneding Order
6000 Pre-sorted in Descending Order
Algorithm Time (in ms) Merge Sort 9.54 Quicksort 76.78
Insertion Sort 106.86
Graph 4. Sorting Algorithms’ Performance on an Array of 6000 Pre-sorted Numbers in Descending Order
Sorting Algorithms Analysis 15
This time, the tables were turned on the Insertion Sort and it did very poorly. Unlike with
the pre-sorted array in ascending order where no work had to be performed, the Insertion Sort
was forced to do the maximum amount of work possible – every element needed to be shifted on
each pass. With respect to Quicksort, it encountered the same partitioning problem it had with
the previous array. The only difference with this run was that now all the other values were
smaller than the pivot value instead of larger. As in the previous test, Merge Sort did its job in
about the same amount of time.
Multiple Random Order Arrays
Rounding out our testing regimen was the task of sorting a number of arrays containing
values arranged in random order. The results from both runs are shown below:
Table 5. Sorting Algorithm Average Performance on Arrays of
1000 Random Numbers over 100 Runs 1000 Random Order - 100 Runs
Algorithm Average Time (in
ms) Merge Sort 0.34 Quicksort 0.14 Insertion
Sort 0.84
Graph 5. Sorting Algorithms’ Average Performance on Arrays of
1000 Random Numbers over 100 Runs
Sorting Algorithms Analysis 16
Table 6. Sorting Algorithms’ Average Performance on Arrays of
6000 Random Numbers over 1000 Runs 6000 Random Order - 1000 Runs
Algorithm Average Time (in
ms) Merge Sort 3.55 Quicksort 1.21
Insertion Sort 52.07
Graph 6. Sorting Algorithms’ Average Performance on Arrays of
6000 Random Numbers over 1000 Runs
Sorting Algorithms Analysis 17
Once again, the Insertion Sort is dead last. The Quicksort finally had its chance to shine
now that the chances of picking a reasonable pivot value were significantly increased. It actually
outperformed the Merge Sort for the first time. Still, the Merge Sort wasn’t far behind,
performing approximately as well in this test as it did in the others as we’ve come to expect.
Advanced Discussions of Comparison Sort Performance
To review, the Insertion sort with a run time of Θ(n2), is thus asymptotically slower in
the worst case, and may generally be faster for 10 or less numbers (Leiserson & Demaine,
2005a). Mergesort has a Θ(nlgn) run time and is typically a faster algorithm when the data set is
over 30. According to MIT Professors, Leiserson and Demaine (2005b), the Quicksort can
typically sort twice as fast as Mergesort and caching and virtual memory performance under
Quicksort is an additional advantage. Its performance is impacted by the number of repeated
elements, however, given distinct elements its worst case run time is O(n2) where the input is
sorted or reverse sorted resulting in the first or second partition containing no elements. In the
case of a randomized Quicksort, where the pivot is random, the danger in the order of the data is
theoretically mitigated, and thus the chance of a best-case run time of O(nlogn) is increased.
Another method is finding a random sample upon which to base a best-guess median as the
pivot. A more in depth discussion of factors outside of our data sets, which can influence the
performance of sorting algorithms such as Quicksort and Mergesort, follows.
An external factor, which can influence sorting performance, is the presence of parallel
processing. The divide and conquer strategies which drive Quicksort and Mergesort are ideal for
systems capable of running multiple processes simultaneously. However, the question as to
which sort method is superior in such an environment is not so obvious.
Sorting Algorithms Analysis 18
With Mergesort, each iteration generates its data subsets approximately at the same time.
This timing can cause an additional overhead for maintaining separate data sets until the desired
end stage of one-element arrays is obtained. Without synchronization, you run the risk of some
one-element arrays experiencing downtime while the rest of the unsorted data “catches up”.
With Quicksort as soon as a pivot has divided its data subset, processing can immediately
occur for its two resulting data subsets. Therefore, with Quicksort there is no need to maintain a
consistent timing between the different processes
Although the performance gaps between (N lg N) and O(N2) increases as the unsorted
data set grows (as shown in the table below) both of these algorithms have stages of small data
sorting prior to completion. With Mergesort’s predictability, it is much easier to determine
when datasets will reach such small sizes. This predictability allows hybrid Mergesort models to
know when to switch temporarily to Insertion Sort and convert back to Mergesort for combining
larger data subsets. With Quicksort, such a transition is much more difficult to predict.
The table below shows the number of comparisons needed for varying dataset sizes. As
we can see the speed advantage shifts back towards the more basic sorting algorithms.
Table 7. Comparison count between sorting algorithms for various datasets.
n (elements in dataset)
(N^2)/4 (Insertion Sort)
# of Comparisons
n log(n) # of
Comparisons 5 7 20 10 25 34 16 64 33 100 2500 664
Another factor to consider when choosing between Quicksort and Mergesort involves
data availability prior to swapping. Quicksort’s optimization takes advantage of direct access to
all data (sorted and unsorted). Certain data structures do not provide such direct access, such as
Sorting Algorithms Analysis 19
linked lists. The need to “step through” in-between data to implement a swap will therefore have
additional overhead. Mergesort is not as reliant on in-place swapping as Quicksort, which is one
of the reasons Mergesort requires the extra storage space when running.
From the topics very briefly covered in this section, we can see that choosing a sorting
algorithm might not always be obvious and might require insight into more than the unsorted
data. Computer hardware as well as algorithm predictability might also be needed to taken into
account.
Conclusions
After analyzing the results of all of the tests we performed, we came to some important
conclusions regarding the use of these sorting algorithms with different kinds of arrays. First and
foremost, it is clearly important to keep in mind not only the size of the data structure, but also
the arrangement of the data it contains. For example, consider the Insertion Sort algorithm. It
would be an excellent choice for small arrays because it is so simple to implement. Feed it a
large array of pre-sorted data in descending sequence, however, and it quickly grinds to a halt.
The Quicksort algorithm seems to be a better choice for tough jobs than the Insertion Sort, but it
is imperative that a proper pivot value be selected or it will quickly deteriorate to the same
abysmal performance as the average running time of an Insertion Sort: O(log n2). Even the
Merge Sort has its darker side. Its need to create an output array the same size as the array to be
sorted can cause problems if the array is very large because there may not be enough system
memory available.
Another key consideration is the complexity of implementing a given sorting algorithm
with respect to the project at hand. If performance is not a prime factor, choosing a more easily
understood algorithm can potentially mitigate the occurrence of software bugs and make the task
Sorting Algorithms Analysis 20
of debugging much simpler. Also, there may not be sufficient programming time to successfully
implement a more complex algorithm. However, when performance is important, a more
complex solution must be embraced despite the cost or the project will not be successful.
As a team, we have learned a great deal about data structures and algorithms during this
course. This final project, in particular, afforded us the opportunity to explore these core
concepts more deeply in order to gain a better understanding of what it takes to be good
programmers. As our results show, the ability to match, successfully, data structures with the
algorithms that will process these efficiently, is an essential skill every programmer should
possess.
Sorting Algorithms Analysis 21
References
Drake, P. (2006). Data Structures and Algorithms in Java.
Indianapolis: Sams Publishing.
Lafore, R. (2003). Data Structures and Algorithms in Java (2nd ed.).
Leiserson, C. & Demaine, E. (2005a). Lecture #1: Analysis of Algorithms, Insertion Sort,
Mergesort. Retrieved November 29, 2008, from the Massachusetts Institute of
Technology: MIT, OpenCourseWare, 6.046J / 18.410J Introduction to Algorithms (SMA
5503), Fall 2005 Web site: http://ocw.mit.edu/NR/rdonlyres/Electrical-Engineering-and-
Computer-Science/6-046JFall-2005/67FE51D6-D8AC-4426-A149-
B10A84AAECE6/0/lec1.pdf
Leiserson, C. & Demaine, E. (2005b). Lecture #4: Quicksort, Randomized Algorithms. Retrieved
November 29, 2008, from the Massachusetts Institute of Technology: MIT,
OpenCourseWare, 6.046J / 18.410J Introduction to Algorithms (SMA 5503), Fall 2005
Web site: http://ocw.mit.edu/NR/rdonlyres/Electrical-Engineering-and-Computer-
Science/6-046JFall-2005/9427EAAA-9522-4B26-8FDA-99344B8518AD/0/lec4.pdf
Merge Sort. (2008, December). Retrieved December 14, 2008, from Wikipedia
http://en.wikipedia.org/wiki/Merge_sorthttp://en.wikipedia.org/wiki/Merge_sort
Sedgewick, R. & Wayne, K. (2007). Mergesort and Quicksort. Retrieved November 29, 2008,
from the Princeton University, Princeton University Computer Science Department,
Computer Science 226: Algorithms and Data Structures, Spring 2007 Web site:
http://www.cs.princeton.edu/courses/archive/spr07/cos226/lectures/04MergeQuick.pdf
Upper Saddle River: Pearson Education, Inc.
Sorting Algorithms Analysis 22
Skiena, S. (1997, October 24). Mergesort and Quicksort Lecture 17. Retrieved December 13,
2008, from the State University of New York at Stony Brook Computer Science
Department, CS 214 Data Structures Web site:
http://www.cs.sunysb.edu/~skiena/214/lectures/lect17/lect17.html