http://qs321.pair.com?node_id=447633

A couple weeks ago I got somewhat annoyed when several people reccomended a Schwartzian Transform for an extremely trivial sorting operation involving at most a few dozen items. The Schwartzian is really cool, and it is a technique that every Monk should know and understand. But there are times when it's simply a waste. The solution I posted was simple, straightforward, and inefficient. Also, unlike some of the untested Schwartzian examples posted, it worked. It was:

@files = sort { my ($ad) = ( $a =~ /fwlog\.(\d+)\w+/ ); my ($bd) = ( $b =~ /fwlog\.(\d+)\w+/ ); $ad <=> $bd } @files;

This code is not fast, but it's easy to write, understand, and debug. I call this property WUD. If your code has a lot of WUD you probably finished the job sooner than expected and made life easier for the next person who has to work on it. Unfortunately, WUD is usually inversely proportional to efficiency. Optimization, therefore, is the art of determining whether speed or WUD is more important in a given situation.

Before we continue, let's remind ourselves of one fact: We are talking about Perl. Perl is slow. It eats up gobs of memory has painful overhead for sometimes annoyingly trivial operations. But Perl has tons of WUD. It's loaded with WUD. Fifteen minutes hacking out a one-off Perl program can accomplish tasks that would take a C programmer months and a shell scripter days. But even with all that overhead, that naive sorting algorithm, which must stupdily re-do the regex repeatedly on the same data, takes less than a quarter of a second to sort a list of one hundred strings on my computer. Is the WUD worth it? I think so.

Here I will look at four of the most common sorting idioms used in Perl on lists of various sizes. We have a list of strings in the form /foo\d{1,3}\.tla/ and we want to sort them numerically by those digits in the middle. Perl's default lexicographic sorter won't work, because the digits are not zero-padded on the left, and the list will come out looking like:

foo1.tla foo10.tla foo100.tla foo101.tla foo102.tla etc.

For the benchmarks below I will use a list of 250 randomly generated strings of this type. Let's take a look at our four idioms:

Here is the naive sort, a variation of which we saw above:

sub naive { sort { my ( $ma ) = $a =~ /foo(\d+)\.tla/; my ( $mb ) = $b =~ /foo(\d+)\.tla/; $ma <=> $mb } @_; }

Here is the Orcish Maneuver, which is the same as the naive sort, except we cache the results of the regexes so we only have to do them once per item. (Orcish is a pun on "or-cache.")

sub orcish { my %c; sort { $c{$a} or ( $c{$a} ) = $a =~ /foo(\d+)\.tla/; $c{$b} or ( $c{$b} ) = $b =~ /foo(\d+)\.tla/; $c{$a} <=> $c{$b}; } @_; }

Here is the ubiquitous Schwartzian Transform, which constructs a list of array references containing the original datum and the chunk returned by the regex, sorts on the latter chunk, and then maps out the original datum from the resulting list.

sub schwartzian { map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, /foo(\d+)\.tla/ ] } @_; }

Finally, here is the Guttman-Rosler Transform, which is similar to the Schwartzian, except instead of array references it constructs strings containing the numbers padded on the left. The strings are then sorted using Perl's built-in default lexicographic sorter, which is very fast C and does not rely on a Perl callback.

sub guttros { map { substr( $_, 3 ) } sort map { sprintf( "%03d", /foo(\d+)\.tla/ ) . $_ } @_; }

Here is a benchmark showing the speed of these algorithms on a list of 250 randomly generated strings.

NOTE: The following benchmark results are wrong. See the Update at the bottom.

Benchmark: timing 5000 iterations of guttros, naive, orcish, schwartzian... guttros: 14 wallclock secs (14.10 usr + 0.00 sys = 14.10 CPU) @ 354.61/s (n=5000) naive: 0 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU) @ 125000.00/s (n=5000) orcish: 1 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU) @ 125000.00/s (n=5000) schwartzian: 26 wallclock secs (25.70 usr + 0.03 sys = 25.73 CPU) @ 194.33/s (n=5000)

On our relatively tiny data set, the two simplest sorts perform magnificently compared to the ST and GRT. In fact, at even five thousand iterations, naive and orcish emitted warnings that the benchmark would not be accurate. Further testing reveals that the overhead of the caching operation provides no speed benefit over the naive sort for this small data set. Keep in mind that this list is ten times as big as the one in the original thread that inspired this screed.

So how big must our data sets be in order for the extra overhead of the ST and GRT to worth the reduction in WUD? A benchmark with one thousand strings shows that naive and orcish are nearly unchanged, while the execution time for both the ST and GRT increased by more than four times. (However, the GRT is nearly twice as fast as the ST.)

Further increases in data size would be an exercise in futility. What we've discovered is that performing this simple regex is actually less costly than the significant overhead in setting up the ST and GRT.

The lesson here is that these fancy sorting algorithms are only useful when the operation to construct the sort-key (in this case, extracting the numbers with a regex) is more expensive than the overhead of the fanciness.

So how do we decide? Run benchmarks on everything? Not even necessary. Just ask yourself the following questions:

  1. How big is my dataset likely to be? Unless it's huge, a naive sort is probably the best sort.
  2. How often must this sort run? If it's a once-a-week cronjob, it probably doesn't matter if it takes ten seconds instead of one second. The extra nine seconds are a small price to pay for lots of WUD. On the other hand, if it has to run once every minute, then those nine seconds are important.
  3. Is my dataset likely to grow? If it is, maybe it's worth doing a fancier algorithm now. On the other hand, maybe it's worth keeping the simple one until the data gets big enough to require a fancy one.

In short, a bit of forethought and common-sense can save you a lot of trouble down the road.

Happy sorting.
- friedo

Update: Thanks to Pustular and ikegami for pointing out the Benchmark mistake. (I had forgotten that sort does a no-op if in void context.) After re-testing I've found that the Schwartzian becomes more efficient than the naive sort for this operation for lists bigger than 200 items, and the GRT at 120 on my machine. So I was clearly wrong about the extent of the ST and GRT overhead. My point about eliminating complexity for trivial cases still stands, though. :)

Replies are listed 'Best First'.
Re: Choosing the right sort (Or: WUDsamatter?)
by tlm (Prior) on Apr 14, 2005 at 04:36 UTC

    Following up on a sharp observation of tye's in CB, I decided to check whether some of your sorts were in fact no-ops. Here's what I benchmarked, and the results:

    srand 0; my @fixture = map 'foo' . int( rand 1000 ). '.tla', 1..5000; use Benchmark 'cmpthese'; cmpthese( -1, { naive => sub { ( naive( @fixture ) )[ 0 ] }, orcish => sub { ( orcish( @fixture ) )[ 0 ] }, schwartzian => sub { ( schwartzian( @fixture ) )[ 0 ] }, guttros => sub { ( guttros( @fixture ) )[ 0 ] }, } ); __END__ Rate naive orcish schwartzian guttros naive 3.05/s -- -65% -76% -87% orcish 8.74/s 186% -- -31% -61% schwartzian 12.7/s 317% 46% -- -44% guttros 22.6/s 642% 159% 78% --
    Note that the subs that are benchmarked return the first element of the sorted array.

    the lowliest monk

      Just to clarify Pustular Postulant, the OP must have called the functions in a void or scalar context. sort is optimized to do nothing in that situation, so the functions naive and orcish return instantly. Read more for confirmation of Pustular Postulant's results and confirmation that running the sorts in scalar context give results similar to those of the OP.

      That's why it's important to show one's work :)

        Yes, that's what gave me the idea of asking for the first element of sorted array, though in hindsight there are definitely clearer, more direct ways to force a list context than the one I chose.

        the lowliest monk

Re: Choosing the right sort (Or: WUDsamatter?)
by brian_d_foy (Abbot) on Apr 14, 2005 at 09:53 UTC
      This code is not fast, but it's easy to write, understand, and debug. I call this property WUD.
      HTH :).
      From the OP:
      ... it's easy to write, understand, and debug. I call this property WUD.
      (emphasis added)
Re: Choosing the right sort (Or: WUDsamatter?)
by dragonchild (Archbishop) on Apr 14, 2005 at 12:48 UTC
    WUD is a good term. I like it. It has panache and elan. It doesn't pussyfoot around, getting bogged down in PC crap. It walks in, says its piece, and leaves.

    And it can beat up "Premature optimization is ..." any day of the week, with one hand tied behind its back in bed with the measles. And twice on Sunday. :-)

Re: Choosing the right sort (Or: WUDsamatter?)
by hardburn (Abbot) on Apr 14, 2005 at 13:28 UTC

    Even though almost every sort routine I've ever written in Perl is an ST, I can't think of a single instance where it was done for speed. In fact, once the size of the list is significant, the ST starts eating gobs of memory. So a Out of Memory condition is likely where the ST is supposed to shine.

    I wrote all this code as an ST to keep seperate things seperate. The sort routine should do sorting, not matching regexen, not calling databases, not arithmatic. Sorting. Preferably, it wouldn't even do an array dereferance and lookup, but then I would need a GRT, and the GRT requires the use of a string as an information-hiding mechanism, which I also don't like.

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Re: Choosing the right sort (Or: WUDsamatter?)
by bart (Canon) on Apr 19, 2005 at 08:14 UTC
    I don't get it. Your benchmark is completely bogus, you admit it in an update, but you only put that in a little remark at the bottom. So you're actually still pretending that everything is just fine.

    You should fix the benchmark, for example by assigning the sort result to a lexical array in the subs, and rerun the benchmark, and strike through the flawed benchmark.

    But no, you prefer to pretend everything is fine and dandy. You get a downvote from me. What a waste of time.

    Update As per my request, friedo updated his node a little, pointing out the benchmark is flawed, there where it got inserted. At least, now people are prewarned. Me happy.

WUD: What U Did.
by tphyahoo (Vicar) on Apr 18, 2005 at 09:23 UTC
    I like it too.

    Alternative acronym for WUD: What U Did.

    As in, reconstructing What U Did six months later without having to pull out most of your hair.

Re: Choosing the right sort (Or: WUDsamatter?)
by salva (Canon) on Apr 18, 2005 at 09:36 UTC
    you should try Sort::Key:
    use Sort::Key; @sorted=ikeysort { /foo(\d+)\.tla/; $1 } @data;
      I have added an entry for Sort::Key to Pustular Postulant benchmarks
      use Sort::Key; sub mykeysort { ikeysort { /foo(\d+)\.tla/; $1 } @_ }
      and those are the results for 200 elements:
      Rate naive schwartzian guttros Sort::Key naive 17.9/s -- -73% -79% -83% schwartzian 67.1/s 276% -- -20% -36% guttros 84.0/s 371% 25% -- -19% Sort::Key 104/s 483% 55% 24% --

      for 5000 elements:

      Rate naive schwartzian guttros Sort::Key naive 0.413/s -- -77% -86% -89% schwartzian 1.79/s 334% -- -38% -52% guttros 2.88/s 597% 61% -- -23% Sort::Key 3.74/s 805% 109% 30% --

      for 50000 elements

      s/iter naive schwartzian guttros Sort::Key naive 30.2 -- -78% -87% -90% schwartzian 6.53 363% -- -42% -56% guttros 3.81 693% 72% -- -24% Sort::Key 2.90 942% 125% 31% --

      and for 500000 elements

      s/iter guttros Sort::Key guttros 39.6 -- -25% Sort::Key 29.7 33% --

      so Sort::Key is consistently ~30% faster than guttros ;-)