Shenanigans In CS Homework

As a computer science undergrad, one of topics of study that stands out in my memory is sorting algorithms. This makes some sense as sorting can be a very costly operation, and it serves as a great McGuffin to introduce asymptotic analysis.

So there we sat: Merge, Quick, Heap, (we do not speak of Bubble) – we analyzed them all. We coded various implementations. And then, we did something interesting. We actually ran heavy loads through our implementations and compared the results.

We would compare our own implementations of – say – Heap and Merge sort to determine under what circumstances one would outperform the other. We also compared across students – who's Quick was quickest?

This was the biggest learning experience for me: the constants make a difference.

As a refresher, asymptotic analysis (sometimes referred to as “Big O” analysis), drops any multiplicative or additive constants. Two algorithms – let's call them FooSort and BarSort – may actually run in O(nlgn + 10000) and O(2nlgn) respectively. However, both treated as running in O(nlgn).

There are excellent reasons for this (primarily, that the constants tend to represent implementation or environmental details and so aren't really helpful in measuring the pure, Platonic Ideal of a given algorithm). However, the book that goes into all the details is about 4 inches thick and I'm not feeling that wordy. If you're interested, check out Introduction To Algorithms by Cormen et.al.

So this is what I mean about the constants making a difference. Those “implementation details” that the constants represent can be ignored in the strict CS sense. But when you're dealing with real clock cycles, they can make or break your program. Take FooSort and BarSort above. While both are O(nlgn), FooSort is going to be faster than BarSort after a certain break-even point because of those constants. So you, the programmer, have to choose between two options: given your average dataset, which of the algorithms is better? (Here, we're assuming that the Universe of sorting algorithms is {FooSort, BarSort}.)

Actually, you have a third option. Why sort at all?

This may sound like CS heresy, but hear me out and answer the question. If your answer is something like, “I need to find the smallest and largest elements of an array,” or, “I need to find all the blue widgets,” or anything where you're doing a simple scan after sorting, you're probably wasting your time.

Let's take finding the minimum value in an array of integers. Using sorting, the best we can do is O(nlgn) for the sort then O(1) for the access of either the first or last element (depending on how you sorted). All told though, this is O(nlgn) + O(1) = O(nlgn).

Now, let's not sort. Finding the minimum of an unsorted array requires looping through each element and just updating our minimum as we go. This can be done in O(n). O(n) < O(nlgn) => sorting is bad QED.

Okay, okay, joking – but only a little. The real takeaway is this, if you're just trying to get a simple answer from your data, then are throwing the data away (or even just getting a small number of answers from a small dataset), then you're probably better off not bothering to sort. Sorting makes sense when you're going to be doing extensive, repeated analysis of – or operations over – a given dataset and having it in sorted order will speed up those operations. Sorting for the sake of sorting though, is just a great way of generating heat (or coming up with homework, interview, and test questions).

Excercise for the reader: What are some concrete situations where sorting is essential. When does indexing become a better option?