qdd

I write what I must because I can. fediverse: @qdd@eruditorum.dev

As a computer science undergrad, one of topics of study that stands out in my memory is sorting algorithms. This makes some sense as sorting can be a very costly operation, and it serves as a great McGuffin to introduce asymptotic analysis.

So there we sat: Merge, Quick, Heap, (we do not speak of Bubble) – we analyzed them all. We coded various implementations. And then, we did something interesting. We actually ran heavy loads through our implementations and compared the results.

We would compare our own implementations of – say – Heap and Merge sort to determine under what circumstances one would outperform the other. We also compared across students – who's Quick was quickest?

This was the biggest learning experience for me: the constants make a difference.

As a refresher, asymptotic analysis (sometimes referred to as “Big O” analysis), drops any multiplicative or additive constants. Two algorithms – let's call them FooSort and BarSort – may actually run in O(nlgn + 10000) and O(2nlgn) respectively. However, both treated as running in O(nlgn).

There are excellent reasons for this (primarily, that the constants tend to represent implementation or environmental details and so aren't really helpful in measuring the pure, Platonic Ideal of a given algorithm). However, the book that goes into all the details is about 4 inches thick and I'm not feeling that wordy. If you're interested, check out Introduction To Algorithms by Cormen et.al.

So this is what I mean about the constants making a difference. Those “implementation details” that the constants represent can be ignored in the strict CS sense. But when you're dealing with real clock cycles, they can make or break your program. Take FooSort and BarSort above. While both are O(nlgn), FooSort is going to be faster than BarSort after a certain break-even point because of those constants. So you, the programmer, have to choose between two options: given your average dataset, which of the algorithms is better? (Here, we're assuming that the Universe of sorting algorithms is {FooSort, BarSort}.)

Actually, you have a third option. Why sort at all?

This may sound like CS heresy, but hear me out and answer the question. If your answer is something like, “I need to find the smallest and largest elements of an array,” or, “I need to find all the blue widgets,” or anything where you're doing a simple scan after sorting, you're probably wasting your time.

Let's take finding the minimum value in an array of integers. Using sorting, the best we can do is O(nlgn) for the sort then O(1) for the access of either the first or last element (depending on how you sorted). All told though, this is O(nlgn) + O(1) = O(nlgn).

Now, let's not sort. Finding the minimum of an unsorted array requires looping through each element and just updating our minimum as we go. This can be done in O(n). O(n) < O(nlgn) => sorting is bad QED.

Okay, okay, joking – but only a little. The real takeaway is this, if you're just trying to get a simple answer from your data, then are throwing the data away (or even just getting a small number of answers from a small dataset), then you're probably better off not bothering to sort. Sorting makes sense when you're going to be doing extensive, repeated analysis of – or operations over – a given dataset and having it in sorted order will speed up those operations. Sorting for the sake of sorting though, is just a great way of generating heat (or coming up with homework, interview, and test questions).

Excercise for the reader: What are some concrete situations where sorting is essential. When does indexing become a better option?

I recently needed to re-work a custom parser in Python. The original implementation used a giant if-elif-else loop and was a pain to maintain. I managed to break the parser down into smaller chunks and ended up deleting a TON of code. I did this by setting up small classes to handle each different entity I wanted to parse, and pushed a lot of redundant functionality into the base class. However, I really didn't want to have a static list of available classes to maintain, so I decided to find a way to dynamically find all subclasses of a particular base.

Long story (... something something metaclasses something something ...) short, this is done using the __subclasses__() class method. Like it says on the box, this will return a list of all known subclasses of the given class. The only requirement is that the modules containing your subclasses must be imported.

Here's an overly-simplified example:


# pets.py

class ConfusedAnimal(object):
    @classmethod 
    def get_name (cls):
          return cls.__name__ 

def make_noise (self):
    raise NotImplemented ()

class Cat(ConfusedAnimal):
    def make_noise(self):
        return "Oink!"

class Dog(ConfusedAnimal):
    def make_noise(self):
        return "Hiss!"

def list_species():
    known_pets = ConfusedAnimal.__subclasses__()

    for pet in known_pets:
        print ("{}s go {}".format(pet.get_name(), pet().make_noise()))

Let's run it:


    > import pets
    > pets.list_species()
    Cats go Oink!
    Dogs go Hiss!

That's all there is to it! You could add a hundred new species and use dynamic imports, but list_species would never have to change.

Another late night, so this will be probably be brief. Trying to write something (anything) every day.

Been reading through the Gospel of Mark lately. I've studied it before, but this time it's more because I'm trying to get back in the habit of reading the Bible and praying daily.

What strikes me this time through the book (at least thus far), is how focused Jesus was on the mission at hand. He chose exactly how to teach, when to perform miracles, and when to retreat based on what would best impact the people around him.

To the Pharisees and similar, he was very direct – they had been manipulating God's Law for their own gain and power for years. They studied and knew the Law better than most, and yet they still chose to warp it. As a result, Jesus was very direct with them – he was there to destroy their claim to authority.

With His disciples, He would take more time to gently teach. These were people who generally only had the Pharisee's word to go on, so His approach – to a point – was more gentle and gradual. This allowed the people to work through what they were hearing – to actually engage with Him and His word (like we do today).

Anyway, just some observations.

Could you use a URL shortening service like bit.ly or goo.gl to effectively assign a URL to every atom in the known, observable universe?

According to Universe Today we can put an upper bound on the number of stars at $3.0 \times 10^{23}$, with the upper bound on the number of atoms at a staggering $10^{82}$.

While I couldn't find hard numbers, 30 seconds on StackOverflow provides anecdotal evidence that both bit.ly and goo.gl don't directly limit their URL lengths. So we'll assume that they are restricted externally by what browsers can handle.

The maximum “safe” URL length was studied and documented by the researchers at boutell.com to be roughly 2,000 characters in most modern browsers. Removing the characters “http://bit.ly/" or “http://goo.gl/" gets us to 1,987 characters.

The last piece of data we need is the size of the alphabet used to encode short URLs. Simple logic (and Wikipedia) dictates that most shortening services use a combination of lowercase, uppercase, and digits for a total alphabet size of 62.

Now we can do maths! Actually, very simple maths too. This just turns into a base conversion from base-10 to base-62 for the number of “things” we're interested in, bounded by 1,987. It turns out that we only actually need $\left\lceil\frac{\log(10^{82})}{\log(62)}\right\rceil = 46$ characters! Adding the obligatory protocol and domain information back in we finish at 59 characters, well below our maximum URL size. (Actually, we have a TON of room left over, we should find more universe!)

So yes, bit.ly and goo.gl could actually provide a URL for every atom in the known, observable universe. We may continue on, knowing that regardless of how many lolcats pictures are posted, we will always be able to share them on Twitter.

If you like this sort of thing (using math to answer weird questions), then you should check out the XKCD what-if site. It's one of my favorites, and definitely served as inspiration for this post.

Okay, just a short post. I found the docs answering my question about titles (and a bunch of other stuff). Thank goodness it's just Markdown. If I had to learn another blog/wiki/bored-developer's custom markup I'd ... well ... I dunno what I'd do ... but it wouldn't have a positive impact on my day!

Anyhoo ... I'll probably have longer posts (with actual content) in a few days. But it's late now, and I've had a long day. Deuces!

So, I really like this whole “minimalist editor” thing, but how are titles determined? Or have to moved on to some sort of “post-titleism” in blogging now? Did I miss a memo? Was there a manifesto? Something like,

We, the bloggers of the Internet, no longer recognize the tyranny of the “title”. We refuse to accept the idea that our 3,000 word masterpieces of pure intellectual essence can – nay, should! – be boiled down into a “snappy one-liner”. For too long have our thoughts, our ideas, our very souls, been forced to conform to the outmoded norms of a literary society long since proven irrelevant! I say, unite! Unite against those that would see our works watered down to serve aesthetic sensibilities! Unite against the indexers that would see our vision boiled down to n-gram soup! Unite!

#manifesto #downwithtitles #sarcasticpostissarcastic