Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

RFC:A brief tutorial on Perl's native sorting facilities.

by BrowserUk (Patriarch)
on Feb 05, 2007 at 19:55 UTC ( [id://598394]=perlmeditation: print w/replies, xml ) Need Help??

It has been suggested to me that Re^3: sort array by date might make a useful addition to the tutorials section of PM?

Below is a cleaned up and somewhat proof-read version of that post for your consideration. Suggestions as to which category of the Tutorials section it belongs are are invited.


A brief tutorial on Perl's native sorting facilities.

Simple sorting using sort

The basic method of sorting in Perl is to use the built in sort function.

  1. my @sortedAlpha = sort @alphaData;

    This does pretty much exactly what you'd expect. Takes an array(*) of data, orders it ascending (smallest first) according the ordinal (ASCII**) values of the characters in the strings:

    print for sort qw[ 12 a123 a122 A123 B123 Ab123 aB123 456 1A23 1a23 ] +;; 12 ## ord('1') == 49, ord('2') == 50 1A23 ## ord('A') == 65 so this comes after. 1a23 ## ord('a') == 97 " 456 ## ord('4') == 52 " A123 ## ord('A') == 65, ord('1') == 49 " Ab123 ## ord('b') == 98 " B123 ## etc. a122 a123 aB123

    *actually a list, but ignore that for now.

    **if your data is unicode, then the unicode ordinal values are used.

  2. my @reversedSordedAlpha = sort{ $b cmp $a } @alphaData;

    Here we want the data ordered in descending (largest first) sequence, so we have(*) to use code block to alter the default ordering.

    The { $b cmp $a } part of that tells sort that we wish to perform the ordering ourselves instead of using the default as we did in 1 above (which is equivalent to { $a cmp $b }). This says that for each pair of values in the input, Perl passes them to us (as $a and $b), and we perform a comparison and return

    • -1 if $a is less than $b
    • 1 if $a is greater than $b
    • 0 if $a equals $b

    With this information, sort can correctly do what it needs to do to reorder the input appropriately.

    By using the built-in cmp operator that is specifically designed for comparing strings and producing these values, we achieve the result we are after.

    Supplying the arguments $a and $b to the function in reverse order has the effect of reversing the sort.

    However, if we use cmp on numeric data, we do not get the effect we want:

    print for sort{ $b cmp $a } qw[ 1 10 100 2 20 21 3 300 ];; 300 3 21 20 2 100 10 1

    For numeric data we need to use the other built-in comparison operator <=>.

    *we can also use reverse, and that may actually be quicker, but this is a tutorial and this way demonstrates the point I wish to make.

  3. my @reverseSortedNumbers = sort { $b <=> $a } @numbers;

    Using <=> produces the desired result.

    print for sort{ $b <=> $a } qw[ 1 10 100 2 20 21 3 300 ]; 300 100 21 20 10 3 2 1

    The difference is that <=> compares the (binary) numerical value of its arguments, instead of the bytewise string comparisons of the ASCII representation that cmp does.

  4. my @sortedNumbers = sort{ $a <=> $b } @numbers;

    And for an ascending numerical sort, we just pass the parameters to <=> in the 'right' order.

How sort actually works internally doesn't really matter. What does matter is that being built-in to the language, it understands Perl's internals, is always available and about as fast as you are likely to achieve.

So, sorting data is easy, when you want to sort according to the entire value of each element of the array or list. It gets a little more complex when you want to sort the elements by a sub field of each element.

For example, say you have a set of ids (A123 B421 C987 etc.) and you want to order them by just the numeric part, then you need to extract that part before supplying it to the appropriate comparison function:

print for sort{ substr( $a, 1 ) <=> substr( $b, 1 ) } qw[ A473 B659 C123 D222 E001 ];; E001 C123 D222 A473 B659

This works fine, but each element will be passed to the comparison block multiple times, and that means we are extracting the sort field from each element multiple times, which can slow things down. For more complex cases, (eg. data consisting of dates), this could be significant. The purposes of the "advanced sorting" mechanisms is to reduce this slowdown by performing the extraction of the significant data from each input element once only.

The ST

One mechanism, and possibly the simplest to understand, is merlyn's ST (Schwartzian Transform). This works by pre-processing the elements to extract the fields, and packages them into anonymous arrays. The list of anonymous arrays are then sorted according to the extracted field. And finally, the original elements are recovered from the list of sorted anonymous arrays and stored into the sorted array.

## Build an array of anonymous arrays, ## each of which contains the sort field and the original element. @anons = map{ [ substr( $_, 1 ) , $_ ] } qw[ A473 B659 C123 D222 E001 +];; print Dumper \@anons;; $VAR1 = [ ['473','A473'],['659','B659'],['123','C123'], ['222','D222'],['001','E001'] ]; ## Now sort the anonymous arrays ## by comparing the extracted fields. @sortedAnons = sort{ $a->[ 0 ] <=> $b->[ 0 ] } @anons;; print Dumper \@sortedAnons;; $VAR1 = [ ['001','E001'],[123,'C123'],[222,'D222'], [473,'A473'],[659,'B659'] ]; ## Finally, build the required sorted array ## by extracting the original elements discarding the sort fields. @sorted = map{ $_->[ 1 ] } @sortedAnons;; print Dumper \@sorted;; $VAR1 = ['E001','C123','D222','A473','B659'];

This is quite simple to follow and can be coded in a more compact form, by combining the 3 steps into a single statement and doing away with the intermediate arrays:

@sorted = map{ $_->[1] } sort{ $a->[0] <=> $b->[0] } map{ [ substr( $_, 1 ), $_ ] } qw[ A473 B659 C123 D222 E001 ];; print Dumper \@sorted;; $VAR1 = ['E001','C123','D222','A473','B659'];

This is identical to the previous example, but more compact. It can also be formatted as a single line, though that will usually mean that it extends beyond the boundary at which PM does its dumb wrapping. It can also be laid out in several different ways to that I've shown, but I find that this makes the three significant steps of the algorithm clearly delineated, by indenting the contents of the code blocks just as you would for any other block of code. Eg. The body of if and while statements etc. I like this consistency.

The efficiency of the ST comes from only performing the extractions of the field data once. Indexing the anonymous arrays is very efficient and so the sorting and reassembly doesn't costs much extra. It also extends to sorting multiple fields in an obvious way:

print for map{ $_->[2] } sort{ $a->[0] <=> $b->[0] || $a->[1] cmp $b->[1] } map{ [ substr( $_, 1 ), substr( $_, 0, 1 ), $_ ] } qw[ A473 B437 B659 C659 C123 D123 D222 E222 E001 A001 ];; A001 E001 C123 D123 D222 E222 B437 A473 B659 C659

Here we've ordered the data according to the numeric field as before, but this time we've used the alpha field to 'tie-break' values when the numeric values are the same. It's easy to see how this can be further extended to handle as many fields of any type we need. This simple, obvious extension mechanism is the great strength of the ST.

The caveats

The downside of the ST is that for large datasets, the creation and destruction of all the small anonymous arrays can itself prove to be fairly expensive of time and memory.

It's also the case that calling back into Perl code via the sort comparison block imposes a significant time cost. (Don't worry about this for the simple cases of sort{ $a <=> } ... and sort{ $b cmp $a } shown above, as Perl has optimisations to recognise these simple cases, and it doesn't actually do a callback at all).

But for more complex sorting, these callbacks are expensive, and if you can avoid them, it will save some considerable time.

The GRT

This is where the (Guttman Rosler Transform) comes in. The basic idea here is that instead of building an anonymous array in the pre-processing stage, we prepend the field data to the front of the string; sort the strings; then strip the bit we stuck on the front back off.

Going back to the simpler example of a field sort above, we can construct the GRT as follows:

print for map{ ## Chop off the bit we added. substr( $_, 3 ) } sort map{ ## Note: No comparison block callback. ## Extract the field as before, but concatenate it with the origin +al element ## instead of building an anonymous array containing both elements +. substr( $_, 1 ) . $_ } qw[ A473 B659 C123 D222 E001 ];; E001 C123 D222 A473 B659
A cheat

This is actually something of a cheat of a GRT. It only works because in my carefully constructed example data, the numeric fields all have the same number of digits, with smaller values having leading zeros. Whilst this works fine, and makes the algorithm easy to follow (try adding print statements inside the code blocks to follow what is going on), real life data is rarely so accommodating. Whilst you could use sprintf to add leading zeros and so make your ascii encoded numeric fields sort correctly using the string comparisons (cmp), it turns out that there is a better (faster) way.

That way is to use pack to convert the numeric fields into binary values that will sort correctly using a string comparison function. It is convenient that binary encode integers (NOTE:In 'network' format only. That is 'N'&'n' *NOT* 'V'&'v') will sort correctly using a string comparison function. As will floating point data in IEEE formats (the format used internally by Perl on most but not all systems).

So, using pack, the previous example becomes:

print for map{ unpack 'x[N] A*', $_ } sort map{ pack 'N A*', substr( $_, 1 ), $_ } qw[ A473 B659 C123 D222 E001 ];; E001 C123 D222 A473 B659

That may not look so very different from the previous version, but the big advantage is that it will handle any integer values correctly, whereas with the previous version, you need to know how big the largest numeric field is, so that you can add the correct number of leading zeroes.

It also trivially extends to handle the two fields example shown above for the ST:

print for map{ unpack 'x[NA1]A*', $_ } sort map{ pack 'NA1 A*', substr( $_, 1 ), substr( $_, 0, 1 ), $_ } qw[ A473 B437 B659 C659 C123 D123 D222 E222 E001 A001 ]; A001 E001 C123 D123 D222 E222 B437 A473 B659 C659

The pattern

Hopefully, the pattern is becoming obvious. Use pack with appropriate templates 'N', 'n', 'd' etc. to prepend the sort field data to the strings in the order or precedence. Then use unpack and the same template as you used for the sort fields, but wrapped in 'x[...]' to strip the prefix off and recover the original elements.

A worked example

This following is one possible solution to sort array by date:

#! Perl -slw use strict; my %months = ( FY => 0, Jan => 1, Feb => 2, Mar => 3, Apr => 4, May => 5, Jun => 6, Jul => 7, Aug => 8, Sep => 9, Oct => 10, Nov => 11, Dec => 12, ); print for map{ unpack 'x[nn]A*', $_ } sort map { my( $alpha, $num ) = m[^(\S+?)\s*(\d+)$]; $num += 2000 if $num <= 49; $num += 1900 if $num <= 99; pack 'nnA*', $num, $months{ $alpha }, $_; } <DATA>; __DATA__ Apr 2006 FY05 FY98 FY04 Dec 2007 Jan 1997 Jan 1998 Dec 1998

Output:

C:\test>junk Jan 1997 FY98 Jan 1998 Dec 1998 FY04 FY05 Apr 2006 Dec 2007

Further reading

  1. Schwartzian Transform
  2. Advanced Sorting - GRT - Guttman Rosler Transform
  3. Choosing the right sort (Or: WUDsamatter?)
  4. fast, flexible, stable sort
  5. Re: Sort on Number Embedded in String
  6. Re: Sort on Number Embedded in String
  7. Resorting to Sorting

Updated with many corrections as suggested by planetscape, jdporter, holli & blazar. Many thanks to all.

Replies are listed 'Best First'.
Re: RFC:A brief tutorial on Perl's native sorting facilities.
by japhy (Canon) on Feb 05, 2007 at 22:16 UTC
      Hi Jeff,

      Thanks for the link - another one added to my collection.

      Since both the OP and the link you posted are very similar, may I suggest combining the two into one resource to avoid duplication of effort/information.

      Regards,
      Mahesh

Re: RFC:A brief tutorial on Perl's native sorting facilities.
by hawtin (Prior) on Feb 06, 2007 at 10:20 UTC

    When I am doing things with sort I have a standard routine (which I call lexically) that attempts to implement a more natural sort order.

    Because of the scoping of $a and $b I have never made this into a module but normally just copy it into the script

    foreach my $foo (sort lexically @list) { ... } ... sub lexically { # This routine gives us a sort order that more closely matches # what a naive use would expect (ie "9-z" comes before "10-a") # The simple implementation style makes the code easier # to follow my($a_,$b_) = @_; # If we are called by sort the old @_ gets left around # attempt to detect this and grab values from $a and $b if(!defined($a_) || !defined($b_) || ref($a_) || ref($b_) || $#_ != 1) { $a_ = $a; $b_ = $b; } return 0 if($a_ eq "" && $b_ eq ""); return -1 if($a_ eq ""); return 1 if($b_ eq ""); my($a_1,$a_t,$a_2,$b_1,$b_t,$b_2); if($a_ =~ /^(\d+)/) { $a_t = 0; $a_1 = $1; $a_2 = $'; } elsif($a_ =~ /^(\D+)/) { $a_t = 1; $a_1 = $1; $a_2 = $'; } if($b_ =~ /^(\d+)/) { $b_t = 0; $b_1 = $1; $b_2 = $'; } elsif($b_ =~ /^(\D+)/) { $b_t = 1; $b_1 = $1; $b_2 = $'; } if($a_t == 0 && $b_t == 0) { # Both start with a numeric value return lexically($a_2,$b_2) if($a_1 == $b_1); return $a_1 <=> $b_1; } if($a_t == 1 && $b_t == 1) { # Both start with text, fold everything to # lower case (so "A" is the same as "a" # which is what I want, YMMV) my $r = lc($a_1) cmp lc($b_1); return lexically($a_2,$b_2) if($r == 0); return $r; } return -1 if($a_t == 0); return 1; }
      I have never made this into a module but normally just copy it into the script

      Your post was a great addition to the sorting tutorial. So it pains me to point out Sort::Naturally

      -xdg

      Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: RFC:A brief tutorial on Perl's native sorting facilities.
by holli (Abbot) on Feb 06, 2007 at 09:34 UTC
    By using the built-in cmp function
    cmp is not a function, it's an operator. And your link does not lead to the perldoc of cmp, but to the homenode of the user cmp.

    Besides that, good job.


    holli, /regexed monk/

      Thanks. Corrected above.

      Now the $64,000 questions:

      • Is there anything in the OP that is not covered elsewhere?
      • Or does it stand alone sufficiently for it to warrent it's addition to Tutorials?
      • If so, where?
      • And how is that done?

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Thanks. Corrected above.

        But now you have:

        "By using the built-in perlopcmp operator [...]"

        ITYM something like "By using the built-in cmp operator [...]"

Re: RFC:A brief tutorial on Perl's native sorting facilities.
by jimbojones (Friar) on Feb 06, 2007 at 14:48 UTC
    Hi

    Liked the tutorial. One additional thing that you might want to reference at this point
    How sort actually works internally doesn't really matter. That it is about as fast as you are likely to get; that it is built-in to language so always available; and that it understands Perl internals; does.
    is that the internals of the Perl sort changed from quicksort to a mergesort between 5.6 and 5.8.

    Although the main change is probably to defend against worst-case O(N**2) behaviour, the user can be bitten by this "feature"

    From CPAN Sort.pm
    A stable sort means that for records that compare equal, the original input ordering is preserved. Mergesort is stable, quicksort is not.
    This means if you use 5.6, and your comparator returns 0 for 2 items that are in fact different, the input order and the output order of those items may change. This behaviour burned me recently; others may want to know about it.

Re: RFC:A brief tutorial on Perl's native sorting facilities.
by Jenda (Abbot) on Feb 06, 2007 at 16:20 UTC

    I think another good node about sorting is Better sorting than the ST (was: Perl is dying), I think this "trick" should be included in the tutorial as well.

    { my @keys = map {...} @list; my @idx = sort {$keys[$a] cmp $keys[$b]} (0..$#list); @list = @list[@idx]; } # or { my @keys = map {...} @list; @list = @list[ sort {$keys[$a] cmp $keys[$b]} (0..$#list) ]; }
    As it doesn't create lots of tiny arrays it should even be quicker than ST. And it doesn't have the restrictions of GRT.

        GRT requires that you stringify (serialize) the values of the list you want to sort. That's not always possible. Eg. if it's a list of objects ...

        And even if it's just a list of references to some data structures, by serialization and deserialization you end up with copies of the data structures. Which may matter a lot if you keep other references (in)to the structures.

Re: RFC:A brief tutorial on Perl's native sorting facilities.
by shmem (Chancellor) on Feb 07, 2007 at 09:52 UTC
    Fine post. Some random nits...
    That it is about as fast as you are likely to get; that it is built-in to language so always available; and that it understands Perl internals; does.

    I fail to parse that sentence. I also fail to parse the updated version:

    That it is about as fast as you are likely to get, it is built-in to language so always available, and it understands Perl internals, do.

    What does the final ",do" do?

    So, sorting data when you want to sort according the the entire value of each element of the array or list is easy.

    I'd rather not use the statement-modifier mode here, instead I'd say (s/the the/the/):

    So, sorting data is easy when you want to sort according the entire value of each element of the array or list.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

      Okay shmem. That first sentence was always a mess, and correcting the grammer whilst leaving the structure intact didn't help much. Is this any better?

      How sort actually works internally doesn't really matter. What does matter is that being built-in to the language, it understands Perl's internals, is always available and about as fast as you are likely to achieve.

      s/the the/the/ should be s/the the/to the/. That I haven't noticed this before, having now read and re-read this dozens of times, does not surprise me. That many other pairs of eyes have looked right past this, does. Welcome to my world.

      Corrections made. Thanks.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        How sort actually works internally doesn't really matter. What does matter is that being built-in to the language, it understands Perl's internals, is always available and about as fast as you are likely to achieve.

        Nice. Now I've got it - the "..., does." was refering to the "doesn't matter" from the previous sentence. Anyways - it's way clearer now.

        That I haven't noticed this before, having now read and re-read this dozens of times, does not surprise me. That many other pairs of eyes have looked right past this, does. Welcome to my world.
        clap, clap, thank you - having stuttered as a child I just stumble over repeated words. And, good you showed me the "...,does." construct again ;-)

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://598394]
Approved by liverpole
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-29 08:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found