Thursday, April 2, 2009

Back to the Basics

Over the past several weeks of droning over Jira issues and implementing some custom extension points for IBM WebSphere Integration Developer it has become apparent that I have drifted away from some of the fundamentals that make up Computer Science 101. Over the next few months I'll attempt to blend a review of some of these core concepts with some actual 'contemporary' content.

The first thing I'm going to cover is some basic sorting algorithms. Sorting is one of those things that gets covered up nicely in java development with methods like Collections.sort(...). All the gory details remains hidden and you just get back your list sorted according to its elements natural ordering.

Some things to consider when choosing a sorting algorithm:

1.  Computational Complexity
Big-O Notation.  Most average case sorts can be run in O(n log n).  There are a few O(n^2) but those should probably only be considered for small data sets or academic reasons.

2.  Stability
The term stability as it applies to sorting refers to the final ordering of sorted elements that are equal. If the ordering remains the same, the sort is said to be stable. For example, the following x, y points are going to be sorted based on their x value:

Original:
(8, 3) (2, 5) (3, 7) (8, 9) (4, 4)

Sorted (Stable):
(2, 5) (3, 7) (4, 4) (8, 3) (8, 9)

Sorted (Unstable):
(2, 5) (3, 7) (4, 4) (8, 9) (8, 3)

The list is sorted, but there are 2 different correct answers!

3.  Memory Usage
This of course depends on the implementation of your sort.  Often times it is intuitively easier to comprehend and read the code by storing results in separate structures and returning those.  It is however possible to implement most of the major sorts using only the original list or minimal external structures.  These types of sorts are called "in-place".  The 3 sorts presented in code are all implemented as in-place sorts.

The Sorts:

Bubble Sort
The bubble sort is a very simple sort and is probably only valuable for academic purposes or very small data sets. In a bubble sort, the list to sort is constantly iterated. Consecutive items are compared and swapped if out of order. When there are no swaps made on an iteration, the sort is complete, thus it is very fast on an already sorted list.

Something to note is that the largest unsorted element will be swapped into place with each pass. A slight performance improvement would be to decrement the end index in the main loop on each pass since the last item is always sorted. The bubble sort is stable, but on average runs in O(n^2) time. If there are 10 items in the list 100 comparisons must be made to sort it.

Merge Sort
The merge sort is another stable sort (in most implementations) and on average runs in O(n log n) time. It works by taking the input list, splitting it in half, recursively applying the merge sort to each sublist and merging the results back together.   The merge sort simply walks down the provided 2 lists(a, b) and 'zips' them together in ascending order into a single list of length a.length+ b.length.  There are both in-place and iterative versions of this sort.  The example provided is in-place and recursive.

Quick Sort
The quick sort is another popular sort that runs between O(n^2) worst case and O(n log n) average case.  It is not a stable sort.  The quick sort works by choosing a value called the pivot value.  The list is then broken into two separate lists, values <= pivot and values > pivot.  The quick sort is then applied to the two sublists with the final result being [LT] + [PIVOT] + [GT].  Selection of the pivot value can affect the timing results.  In general a pivot that represents the median value of all the elements in the input list is desirable.  This helps split the [LT] and [GT] lists evenly.  The tradeoff is that determining an optimal pivot value involves more computational cycles.  In the source provided, a random pivot value is selected from the input list.
 
Results:

3 different data sets of 10000 items were used during the testing.  A reverse ordered list, an ordered list and a list of random data (Note: the same random data is used for all tests).  Each sort was run on each list 100 times.  The following results were obtained:

Average Bubble Sort on Reverse Ordered List with 10000 items: 6372 ms.
Average Bubble Sort on Ordered List with 10000 items: 0 ms.
Average Bubble Sort on Random List with 10000 items: 5870 ms.

Average Merge Sort on Reverse Ordered List with 10000 items: 1427 ms.
Average Merge Sort on Ordered List with 10000 items: 2 ms.
Average Merge Sort on Random List with 10000 items: 725 ms.

Average Quick Sort on Reverse Ordered List with 10000 items: 9 ms.
Average Quick Sort on Ordered List with 10000 items: 8 ms.
Average Quick Sort on Random List with 10000 items: 10 ms.

This is where this entry ends.  Analysis on the results would be prudent at some point, but it should be obvious that quick sort just wins, even with non-optimized pivot value selection.  I tried to make a handy graph for this but effectively comparing 10 ms to 6372 ms proved a bit more than my google chart api skills could handle.

References:

No comments: