http://qs321.pair.com?node_id=9380


in reply to Schwartzian Transform

I thought I would add to this snippet since I had to stare at the code for far to long before I figured out how I might use it.

Okay so we know it is for sorting, but why is it useful? It is a heck of a lot faster for sorting large data sets where the data has to be manipulated before sorting. The example I came up with is this:

We have a huge list of names, and we want to sort them alphabetically by last name, first name, then middle name. That doesnt sound too bad. But the problem is that our database does not store last names, first names or middle names seperately. Instead there is just a single name field like "Arthur C. Clarke". So when we get all the data from the database first we have to put the string in alphapbetical sort order (ie make "Arthur C. Clarke" into "Clarke Arthur C.") then sort it.

Someone without Randal Schwartz's insight (namely me) would have done something like this:
my @sorted = sort mysort @unsorted; sub mysort { $a =~ m/(.*?)\s*(\S+)$/; $aa = "$2 $1"; $b =~ m/(.*?)\s*(\S+)$/; $bb = "$2 $1"; return $aa cmp $bb; }
Now this would be bad for large data sets because each element in @unsorted will get sent to mysort several time, so we are then performing the match several times where we dont really need to.

So in order to only perform the match once we go through the array and replaces each element with an array reference containing the original element (dont loose the original!) and the new sortable string. So the first step:
@unsorted = ("Larry Wall", "Arthur C. Clarke"); @a = map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] } @unsorted; # # now @a is (["Larry Wall", "Wall Larry"], # ["Arthur C. Clarke", "Clarke Arthur C."]) #
Now we do a simple sort on the second element in @a: @b = sort {$a->[1] cmp $b->[1]} @a; And finally we turn the 2D array back into a sorted 1D array restoring the original values: @sorted = map {$_->[0]} @b; If you cascade the function calls, then you remove the intermediate steps of my @a, and @b and you get the Schwartzian Transform:
my @sorted = map{ $_->[0] } sort {$a->[1] cmp $b->[1]} map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] } @unsorted;
Now that is pretty cool stuff, but how much of a time savings is this extra work? Well, I did a small benchmarking test to mimic what a time savings would under a large data set:
#!/usr/bin/perl use Benchmark; @unsorted = ("Larry Wall", "Jane Sally Doe", "John Doe", "Morphius", "Jane Alice Doe", "Arthur C. Clarke"); timethese(100000, { "schwartzian" => sub { my @sorted = map{ $_->[0] } sort {$a->[1] cmp $b->[1]} map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" +] } @unsorted }, "sort routine" => sub { my @sorted = sort mysort @unsorted } }); sub mysort { $a =~ m/(.*?)\s*(\S+)$/; $aa = "$2 $1"; $b =~ m/(.*?)\s*(\S+)$/; $bb = "$2 $1"; return $aa cmp $bb; }
And the results:
Benchmark: timing 100000 iterations of schwartzian, sort routine... schwartzian: 57 wallclock secs (53.66 usr + 0.30 sys = 53.96 CPU) sort routine: 124 wallclock secs (122.48 usr + 0.41 sys = 122.89 CPU)

I hope this helps.

Replies are listed 'Best First'.
RE: RE: Schwartzian Transform
by questor (Acolyte) on May 11, 2000 at 23:18 UTC
    Not a bad dissection, but it probably should be pointed out that Tom Christiansen covered it in one of his FMTEYEWTK (Far More Than Everything You Ever Wanted To Know) papers, which you can find under CPAN's documentation section, and it's covered in the Perl Cookbook (Christiansen and Nathan Torkington) in section 4.15.
      In fact, the Usenet message that became the "FMTYEWTK about Sorting" was the message in which the Schwartzian Transform got its name!

      In other words, it was named after me, but not by me. I'd never do anything as vain as that. Not that I'd admit to, anyway. :)

        I believe the Usenet message in question is this one. At any rate, it's the earliest matching message Google Groups has on record.

        Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.
Re: RE: Schwartzian Transform
by Vynce (Friar) on May 23, 2001 at 01:05 UTC
    i hate to get picky, but your Schwartzian transform doesn't look like merlyn's to me.


    perlmonkey's:
    my @sorted = map{ $_->[0] } sort {$a->[1] cmp $b->[1]} map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] } @unsorted;
    merlyn's:
    my @output = map { $_->[1] } sort { $a->[0] cmp $b->[1] } map { [$_, expensive_func($_)] } @input;
    is this because the ST is more general than i thought, or is there a typo in merlyn's ST writeup? i'm not sure i'm comfortable with any sort that compares $a->[0] to $b->[1]... is there ever a good reason to do that?

    update: looks like merlyn's changed his to compare index 1 of both arrays. which is almost a shame; i was hoping there was some really cool reason to compare different indices. but anyway, i withdraw the question.

Re^2: Schwartzian Transform
by DBAugie (Beadle) on Mar 17, 2008 at 18:48 UTC
    This helps.

    This node may or may not have said the same thing ten other people have said. But significantly, this time I understood!
    Thanks