Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Natural sorting

by thundergnat (Deacon)
on Dec 14, 2007 at 22:23 UTC ( [id://657130]=perlmeditation: print w/replies, xml ) Need Help??

I was idly browsing Reddit a couple days ago and came across a link to this page discussing natural sorting. On it there are several code snippets implementing a natural sorting algorithm in various languages. The example for perl was frankly, pretty weak; it has been substantially improved on the past few days. but is still not so great IMO.

The Sort-Naturally module on CPAN is better in many respects, but has its own set of problems.

Neither deal with thousands separators very well (or at all). The alphanum routine has problems with arbitrarily large numbers. Sort::Naturally arbitrarily sorts digits to come after letters...except when it doesn't. And neither one is all that fast.

For a project I was working on a while ago, I needed sorting routine that would do the traditional natural sort (sort numbers by magnitude and words alphabetically) as well as a requirement to sort accented characters as their base character. {The word lists were from scanned and OCRed texts that contained a mix of English, French, Spanish and German words and we needed (among other things) to be able to sort the words to make sure that words with accented characters were being generated correctly.}

I took a took a fragment of code that was posted at this node by tye and modified it to suit.

With this routine:

  • Numbers are sorted by magnitude and words alphabetically.
  • Accented letters are treated like their base character for sorting purposes.
  • Thousands separators are ignored (mostly) while sorting.
  • Thousands separator and decimal point character are easily configurable to allow for European style numbers.
  • It is easily configurable to not do the specialized sorting.
  • It's blazin fast. :)


my $decimal = '.'; # decimal point indicator for "natural_sort" my $separator = ','; # thousands separator for "natural_sort" # deaccent will force sorting of Latin-1 word characters above \xC0 to + be # treated as their base or equivalent character. sub deaccent { my $phrase = shift; return $phrase unless ( $phrase =~ y/\xC0-\xFF// ); #short circuit + if no upper chars # translterate what we can (for speed) $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûü +Ýÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/; # and substitute the rest my %trans = qw(Æ AE æ ae Þ TH þ th Ð TH ð th ß ss); $phrase =~ s/([ÆæÞþÐðß])/$trans{$1}/g; return $phrase; } # no-sep will allow the sorting algorithm to ignore (mostly) the prese +nce # of thousands separators in large numbers. It is configured by defaul +t # to be comma, but can be changed to whatever is desired. (a likely po +ssibility is .) sub no_sep { my $phrase = shift; $phrase =~ s/\Q$separator\E//g; return $phrase; } # Very fast natural sort routine. If (not) desired, delete the no-sep +and deaccent # modifiers to remove those effects. sub natural_sort { my $i; no warnings q/uninitialized/; s/((\Q$decimal\E0*)?)(\d+)/("\x0" x length $2) . (pack 'aNa*', 0, +length $3, $3)/eg, $_ .= ' ' . $i++ for ( my @x = map { lc deaccent n +o_sep $_} @_ ); @_[ map { (split)[-1] } sort @x ]; }

For small numbers of words, there isn't much difference in speed between the methods, but sort 100,000 words and the difference gets large.

Here's a benchmark script to compare them. It includes an assortment of numbers and words to compare the capabilities of each. It will write a small file of the sorted data, then copy the data array to itself several times to grow it to a reasonably large size, then benchmark the algorithms.

#/usr/bin/perl use warnings; use strict; use Benchmark qw(cmpthese); use Sort::Naturally; my $decimal = '.'; # decimal point indicator for "natural_sort" my $separator = ','; # thousands separator for "natural_sort" my @array = (<DATA>); # load in some test data open my $fh, '>', 'test.txt'; # open a file to write to for compari +son # write the data out to the file sorted by the different methods print $fh "natural sort\n", natural_sort(@array); print $fh '#' x 80, "\n"; print $fh "Natural::Sort\n", nsort(@array); print $fh '#' x 80, "\n"; print $fh "alphanum\n", sort { alphanum( $a, $b ) } @array; # scale up the data file to a reasonably large size to get a # better idea of how each algorithm scales for ( 0 .. 9 ) { push @array, @array; } # print the size of the data set print scalar @array, " items in array...\n"; my @temp; #temporary array to hold the results of the sorts. cmpthese( -60, { 'alphanum' => sub { @temp = sort { alphanum( $a, $b ) } + @array; }, 'Sort::Naturally' => sub { @temp = nsort(@array) }, 'natural_sort' => sub { @temp = natural_sort(@array) }, } ); ###################################################################### +######### # deaccent will force sorting of Latin-1 word characters above \xC0 to + be # treated as their base or equivalent character. sub deaccent { my $phrase = shift; return $phrase unless ( $phrase =~ y/\xC0-\xFF// ); #short circuit + if no upper chars # translterate what we can (for speed) $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûü +Ýÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/; # and substitute the rest my %trans = qw(Æ AE æ ae Þ TH þ th Ð TH ð th ß ss); $phrase =~ s/([ÆæÞþÐðß])/$trans{$1}/g; return $phrase; } # no-sep will allow the sorting algorithm to ignore (mostly) the prese +nce # of thousands separators in large numbers. It is configured by defaul +t # to be comma, but can be changed to whatever is desired. (a likely po +ssibility is .) sub no_sep { my $phrase = shift; $phrase =~ s/\Q$separator\E//g; return $phrase; } # Very fast natural sort routine. If (not) desired, delete the no-sep +and deaccent # modifiers to remove those effects. sub natural_sort { my $i; no warnings q/uninitialized/; s/((\Q$decimal\E0*)?)(\d+)/("\x0" x length $2) . (pack 'aNa*', 0, +length $3, $3)/eg, $_ .= ' ' . $i++ for ( my @x = map { lc deaccent n +o_sep $_} @_ ); @_[ map { (split)[-1] } sort @x ]; } ###################################################################### +######### # from http://www.davekoelle.com/alphanum.html sub alphanum { # split strings into chunks my @a = chunkify( $_[0] ); my @b = chunkify( $_[1] ); # while we have chunks to compare. while ( @a && @b ) { my $a_chunk = shift @a; my $b_chunk = shift @b; my $test = ( ( $a_chunk =~ /\d/ ) && ( $b_chunk =~ /\d/ ) ) ? # if both are numeric $a_chunk <=> $b_chunk : # compare as numbers $a_chunk cmp $b_chunk; # else compare as strings # return comparison if not equal. return $test if $test != 0; } # return longer string. return @a <=> @b; } # split on numeric/non-numeric transitions sub chunkify { my @chunks = split m{ # split on (?= # zero width (?<=\D)\d | # digit preceded by a non-digit OR (?<=\d)\D # non-digit preceded by a digit ) }x, $_[0]; return @chunks; } ###################################################################### +######### __DATA__ 1 2.3 7 .7 .07 .009 .007 0.008 0.08 0.0008 1000 100.0 1,100.0 1,100.2 1100.95 1100.0 1100 08057 07011 90210 19105 2 20 20A 20X 20.1 200 2000 10000000000 1,000,000,001.00 1000000001.00 1000000001 1000000001.60 1234567890123456789012345678901234567890123456789012345678901234567890 +123456789012345678901234567890123456789012345678901234567890123456789 +012345678901234567890123456789012345678901234567890123456789012345678 +901234567890123456789012345678901234567890123456789012345678901234567 +890123456789012345678901234567890123456789012345678901234567890123456 +7890123456789012345678901234567890123456789012345678902 1234567890123456789012345678901234567890123456789012345678901234567890 +123456789012345678901234567890123456789012345678901234567890123456789 +012345678901234567890123456789012345678901234567890123456789012345678 +901234567890123456789012345678901234567890123456789012345678901234567 +890123456789012345678901234567890123456789012345678901234567890123456 +7890123456789012345678901234567890123456789012345678901 1st 2nd 3rd 33rd 144th apple Ball bald car Card Æon aether niño nina e-mail évian evoke foo fooa fool foo1 foo11 foo2 p4 p5 P6 p10 z1.doc z10.doc z100.doc z101.doc z102.doc z11.doc z12.doc z13.doc z2.doc z20.doc z3.doc z7.doc z8.doc z9.doc

Which yields on my system:

77824 items in array...
                s/iter        alphanum Sort::Naturally    natural_sort
alphanum          17.7              --            -69%            -93%
Sort::Naturally   5.43            225%              --            -76%
natural_sort      1.29           1273%            322%              --   

Now. The point of all this. What should I do with this? Is this worth converting to a module and releasing? If so, what should it be called? (Sort::Naturally is already taken.) Or am I barking up the wrong tree? (Or just barking?) Your comments and criticisms are welcome.

Replies are listed 'Best First'.
Re: Natural sorting
by dragonchild (Archbishop) on Dec 14, 2007 at 23:07 UTC
    Have you gone to the maintainer of Sort::Naturally and offered to provide a patch? Maybe it gets released under that name as part of the Sort::Naturally distro. Or, as is possible, you might even be given the Sort::Naturally distro to maintain. That's how I got my start with both PDF::Template and DBM::Deep.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Natural sorting
by pKai (Priest) on Dec 15, 2007 at 09:10 UTC

      Good point. Being an XS module I would expect it to have good performance, though for some reason it won't build on my system so I haven't been able to test it.

        I would appreciate if you could send my a bug report stating the cause of the build error (the output from make will do), your perl version, OS and processor (or perl -V output).

        oh, and BTW...

        use Sort::Key qw(keysort); sub mkkey { my $key = deaccent $_; $key =~ s/(?<=\d)\Q$separator\E(?>=\d)//g; $key =~ s{0*(\d+)}{ my $len = length $1; "\x00" . ('9' x ($len / 9)) . ($len % 9) . $1 }eg; $key; } ... cmpthese( -30, { #'alphanum' => sub { @temp = sort { alphanum( $a, $b ) +} @array; }, #'Sort::Naturally' => sub { @temp = nsort(@array) }, 'natural_sort' => sub { @temp = natural_sort(@array) }, 'Sort::Key' => sub { @temp = keysort \&mkkey, @array }, } );
        says...
        78848 items in array... s/iter natural_sort Sort::Key natural_sort 3.51 -- -31% Sort::Key 2.41 46% --
Re: Natural sorting
by Not_a_Number (Prior) on Dec 15, 2007 at 09:59 UTC

    Small aside: 'Œ'/'œ' is common in French, as in 'œuf' ('egg'), but is missing from your %trans hash.

      True. But the 'Œ'/'œ' ligature is not in ISO-Latin-1. If I'm going to expand into other codepages, I would be better off using something like Text::Unidecode to decompose the characters. It would be a substantial performance hit, but much easier to maintain.

      It comes down to what is "good enough", I guess.

Re: Natural sorting
by Anonymous Monk on Sep 26, 2014 at 20:31 UTC

    after seeing various complaints about Sort::Naturally then having questions/concerns about another option (Sort::Key::Natural) I "grew my own". it may not be "elegant" (it is kind of a brute force method) but it does seem to handle my cases OK. it may also be easier to customize because it is not using any "tricks".

    I have no problem with many CPAN packages but many of them seem to be confusing to use or seem to be rather large files for what should be simple functions and some do not seem to work well :(. this example may not be the fastest, most elegant, or "cleanest" code but it does the job I need it to do which for most cases boils down to doing things in the "Microsoft" file name order (for my benefit). for the other cases where I may use this I doubt anybody would notice something incorrect anyway.

    #figure this out later#my $separator_characters=" _-\\\\\\\."; # +these are the separator characters ready for use in a regex sub __natural_split_item($) { my @out_array; # # by brute force we will separate a string into the output array # by these cases as they are found: # # separator characters # numerics # alpha/non numeric # # we may need to add other stuff in the future... # # when we change to a different type we will push the previous dat +a into # the output array. # my $last_type=0; # set to no previous my $last_split=""; if(defined $_[0]) # we may get an undefined item.... { for(my $i=0;$i<length $_[0];$i++) { my $thischar=substr $_[0],$i,1; # will need to do +something for unicode if we need it.... my $thistype; # just do not use zero and do not duplica +te values. # # separator characters may need to be handled a little dif +ferently. # first, we need to change them into spaces and second we +need to # push them on the output array as single characters. thi +s may fix # the __ versus _ sort cases where __ shows up first but w +e do the _ # first. # # note that the number selected for $thistype does not mat +ter - we are # only looking for a change in separator type. # #figure this out later#if($thischar =~ /[$separator_charac +ters]/) # this character is a separator characte +r if($thischar =~ /[ _\\\.]/) # this ch +aracter is a separator character { $thistype=100; $thistype++ if($last_type == $thistype); + # repeated separators of any sort must be treated a bit different +ly $thischar=" "; } elsif($thischar =~ /^\d$/) { $thistype=200; } + # this character is a numeric character else { $thistype=300; } + # this character is a non numeric if($last_type!=0 # if we have a previous type (0 m +eans no previous) && $last_type!=$thistype # and we have changed type && $last_split ne "" # and we have something to pus +h (we should if we get to this test) ) { push @out_array,$last_split; $last_split=""; } $last_type=$thistype; $last_split .= $thischar; # add this character to the c +urrent } push @out_array,$last_split; # add the last component } return @out_array; } sub my_custom_natural_sort { my @a_list=__natural_split_item $a; my @b_list=__natural_split_item $b; for (my $i=0;defined $a_list[$i] && defined $b_list[$i];$i++) { next if $a_list[$i] eq $b_list[$i]; # move on to next ite +m if this split component is the same if(($a_list[$i]=~ /\d\d*/) && ($b_list[$i]=~ /\d\d*/)) # if +both numeric to a numeric test first { return $a_list[$i] <=> $b_list[$i] if $a_list[$i] != $b_li +st[$i]; # simple numeric compare if not same value return $a_list[$i] cmp $b_list[$i] if $a_list[$i] ne $b_li +st[$i]; # ascii compare so we can handle leading zero variationsif + not same value } return (lc $a_list[$i]) cmp (lc $b_list[$i]) if (lc $a_list[$i +]) ne (lc $b_list[$i]); # compare as same case first #return ($a_list[$i]) cmp ($b_list[$i]) if ($a_list[$i]) ne ($ +b_list[$i]); # then compare as case sensitive (maybe) } return (length @a_list) <=> (length @b_list); # just do list le +ngth - longer list should be later. if items the same then it does n +ot matter.... }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://657130]
Approved by GrandFather
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2024-04-19 14:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found