Natural sorting

I was idly browsing Reddit a couple days ago and came across a link to this page discussing natural sorting. On it there are several code snippets implementing a natural sorting algorithm in various languages. The example for perl was frankly, pretty weak; it has been substantially improved on the past few days. but is still not so great IMO.

The Sort-Naturally module on CPAN is better in many respects, but has its own set of problems.

Neither deal with thousands separators very well (or at all). The alphanum routine has problems with arbitrarily large numbers. Sort::Naturally arbitrarily sorts digits to come after letters...except when it doesn't. And neither one is all that fast.

For a project I was working on a while ago, I needed sorting routine that would do the traditional natural sort (sort numbers by magnitude and words alphabetically) as well as a requirement to sort accented characters as their base character. {The word lists were from scanned and OCRed texts that contained a mix of English, French, Spanish and German words and we needed (among other things) to be able to sort the words to make sure that words with accented characters were being generated correctly.}

I took a took a fragment of code that was posted at this node by tye and modified it to suit.

With this routine:

Numbers are sorted by magnitude and words alphabetically.
Accented letters are treated like their base character for sorting purposes.
Thousands separators are ignored (mostly) while sorting.
Thousands separator and decimal point character are easily configurable to allow for European style numbers.
It is easily configurable to not do the specialized sorting.
It's blazin fast. :)

my $decimal   = '.';    # decimal point indicator for "natural_sort"
my $separator = ',';    # thousands separator for "natural_sort"

# deaccent will force sorting of Latin-1 word characters above \xC0 to
+ be
# treated as their base or equivalent character.
sub deaccent {
    my $phrase = shift;
    return $phrase unless ( $phrase =~ y/\xC0-\xFF// ); #short circuit
+ if no upper chars
    # translterate what we can (for speed)
    $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûü
+Ýÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/;
    # and substitute the rest
    my %trans = qw(Æ AE æ ae Þ TH þ th Ð TH ð th ß ss);
    $phrase =~ s/([ÆæÞþÐðß])/$trans{$1}/g;
    return $phrase;
}

# no-sep will allow the sorting algorithm to ignore (mostly) the prese
+nce
# of thousands separators in large numbers. It is configured by defaul
+t
# to be comma, but can be changed to whatever is desired. (a likely po
+ssibility is .)
sub no_sep {
    my $phrase = shift;
    $phrase =~ s/\Q$separator\E//g;
    return $phrase;
}

# Very fast natural sort routine. If (not) desired, delete the no-sep 
+and deaccent
# modifiers to remove those effects.
sub natural_sort {
    my $i;
    no warnings q/uninitialized/;
    s/((\Q$decimal\E0*)?)(\d+)/("\x0" x length $2) . (pack 'aNa*', 0, 
+length $3, $3)/eg, $_ .= ' ' . $i++ for ( my @x = map { lc deaccent n
+o_sep $_} @_ );
    @_[ map { (split)[-1] } sort @x ];
}
[download]

For small numbers of words, there isn't much difference in speed between the methods, but sort 100,000 words and the difference gets large.

Here's a benchmark script to compare them. It includes an assortment of numbers and words to compare the capabilities of each. It will write a small file of the sorted data, then copy the data array to itself several times to grow it to a reasonably large size, then benchmark the algorithms.

#/usr/bin/perl
use warnings;
use strict;
use Benchmark qw(cmpthese);
use Sort::Naturally;

my $decimal   = '.';    # decimal point indicator for "natural_sort"
my $separator = ',';    # thousands separator for "natural_sort"

my @array = (<DATA>);   # load in some test data

open my $fh, '>', 'test.txt';    # open a file to write to for compari
+son

# write the data out to the file sorted by the different methods
print $fh "natural sort\n", natural_sort(@array);
print $fh '#' x 80, "\n";
print $fh "Natural::Sort\n", nsort(@array);
print $fh '#' x 80, "\n";
print $fh "alphanum\n", sort { alphanum( $a, $b ) } @array;

# scale up the data file to a reasonably large size to get a
# better idea of how each algorithm scales
for ( 0 .. 9 ) {
    push @array, @array;
}

# print the size of the data set
print scalar @array, " items in array...\n";

my @temp; #temporary array to hold the results of the sorts.

cmpthese(
    -60,
    {
        'alphanum'        => sub { @temp = sort { alphanum( $a, $b ) }
+ @array; },
        'Sort::Naturally' => sub { @temp = nsort(@array) },
        'natural_sort'    => sub { @temp = natural_sort(@array) },
    }
);

######################################################################
+#########

# deaccent will force sorting of Latin-1 word characters above \xC0 to
+ be
# treated as their base or equivalent character.
sub deaccent {
    my $phrase = shift;
    return $phrase unless ( $phrase =~ y/\xC0-\xFF// ); #short circuit
+ if no upper chars
    # translterate what we can (for speed)
    $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûü
+Ýÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/;
    # and substitute the rest
    my %trans = qw(Æ AE æ ae Þ TH þ th Ð TH ð th ß ss);
    $phrase =~ s/([ÆæÞþÐðß])/$trans{$1}/g;
    return $phrase;
}

# no-sep will allow the sorting algorithm to ignore (mostly) the prese
+nce
# of thousands separators in large numbers. It is configured by defaul
+t
# to be comma, but can be changed to whatever is desired. (a likely po
+ssibility is .)
sub no_sep {
    my $phrase = shift;
    $phrase =~ s/\Q$separator\E//g;
    return $phrase;
}

# Very fast natural sort routine. If (not) desired, delete the no-sep 
+and deaccent
# modifiers to remove those effects.
sub natural_sort {
    my $i;
    no warnings q/uninitialized/;
    s/((\Q$decimal\E0*)?)(\d+)/("\x0" x length $2) . (pack 'aNa*', 0, 
+length $3, $3)/eg, $_ .= ' ' . $i++ for ( my @x = map { lc deaccent n
+o_sep $_} @_ );
    @_[ map { (split)[-1] } sort @x ];
}

######################################################################
+#########

# from http://www.davekoelle.com/alphanum.html

sub alphanum {

    # split strings into chunks
    my @a = chunkify( $_[0] );
    my @b = chunkify( $_[1] );

    # while we have chunks to compare.
    while ( @a && @b ) {
        my $a_chunk = shift @a;
        my $b_chunk = shift @b;

        my $test = ( ( $a_chunk =~ /\d/ ) && ( $b_chunk =~ /\d/ ) )
          ?    # if both are numeric
          $a_chunk <=> $b_chunk
          :                         # compare as numbers
          $a_chunk cmp $b_chunk;    # else compare as strings

        # return comparison if not equal.
        return $test if $test != 0;
    }

    # return longer string.
    return @a <=> @b;
}

# split on numeric/non-numeric transitions
sub chunkify {
    my @chunks = split m{ # split on
    (?= # zero width
      (?<=\D)\d | # digit preceded by a non-digit OR
      (?<=\d)\D # non-digit preceded by a digit
    )
  }x, $_[0];
    return @chunks;
}

######################################################################
+#########

__DATA__
1
2.3
7
.7
.07
.009
.007
0.008
0.08
0.0008
1000
100.0
1,100.0
1,100.2
1100.95
1100.0
1100
08057
07011
90210
19105
2
20
20A
20X
20.1
200
2000
10000000000
1,000,000,001.00
1000000001.00
1000000001
1000000001.60
1234567890123456789012345678901234567890123456789012345678901234567890
+123456789012345678901234567890123456789012345678901234567890123456789
+012345678901234567890123456789012345678901234567890123456789012345678
+901234567890123456789012345678901234567890123456789012345678901234567
+890123456789012345678901234567890123456789012345678901234567890123456
+7890123456789012345678901234567890123456789012345678902
1234567890123456789012345678901234567890123456789012345678901234567890
+123456789012345678901234567890123456789012345678901234567890123456789
+012345678901234567890123456789012345678901234567890123456789012345678
+901234567890123456789012345678901234567890123456789012345678901234567
+890123456789012345678901234567890123456789012345678901234567890123456
+7890123456789012345678901234567890123456789012345678901
1st
2nd
3rd
33rd
144th
apple
Ball
bald
car
Card
Æon
aether
niño
nina
e-mail
évian
evoke
foo
fooa
fool
foo1
foo11
foo2
p4
p5
P6
p10
z1.doc
z10.doc
z100.doc
z101.doc
z102.doc
z11.doc
z12.doc
z13.doc
z2.doc
z20.doc
z3.doc
z7.doc
z8.doc
z9.doc
[download]

Which yields on my system:

77824 items in array...
                s/iter        alphanum Sort::Naturally    natural_sort
alphanum          17.7              --            -69%            -93%
Sort::Naturally   5.43            225%              --            -76%
natural_sort      1.29           1273%            322%              --

Now. The point of all this. What should I do with this? Is this worth converting to a module and releasing? If so, what should it be called? (Sort::Naturally is already taken.) Or am I barking up the wrong tree? (Or just barking?) Your comments and criticisms are welcome.

Comment on Natural sorting Select or Download Code

Replies are listed 'Best First'.
Re: Natural sorting by dragonchild (Archbishop) on Dec 14, 2007 at 23:07 UTC
Have you gone to the maintainer of Sort::Naturally and offered to provide a patch? Maybe it gets released under that name as part of the Sort::Naturally distro. Or, as is possible, you might even be given the Sort::Naturally distro to maintain. That's how I got my start with both PDF::Template and DBM::Deep. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re: Natural sorting by pKai (Priest) on Dec 15, 2007 at 09:10 UTC
Did you concider using one of salva's Sort::Key::* modules, maybe Sort::Key::Natural? I'd expect that to show superior performance.	[reply]
Re^2: Natural sorting by thundergnat (Deacon) on Dec 15, 2007 at 16:45 UTC
Good point. Being an XS module I would expect it to have good performance, though for some reason it won't build on my system so I haven't been able to test it.	[reply]
Re^3: Natural sorting by salva (Canon) on Dec 17, 2007 at 19:33 UTC
I would appreciate if you could send my a bug report stating the cause of the build error (the output from `make` will do), your perl version, OS and processor (or `perl -V` output). oh, and BTW... `use Sort::Key qw(keysort); sub mkkey { my $key = deaccent $_; $key =~ s/(?<=\d)\Q$separator\E(?>=\d)//g; $key =~ s{0*(\d+)}{ my $len = length $1; "\x00" . ('9' x ($len / 9)) . ($len % 9) . $1 }eg; $key; } ... cmpthese( -30, { #'alphanum' => sub { @temp = sort { alphanum( $a, $b ) +} @array; }, #'Sort::Naturally' => sub { @temp = nsort(@array) }, 'natural_sort' => sub { @temp = natural_sort(@array) }, 'Sort::Key' => sub { @temp = keysort \&mkkey, @array }, } );` [download] says... `78848 items in array... s/iter natural_sort Sort::Key natural_sort 3.51 -- -31% Sort::Key 2.41 46% --` [download]	[reply] [d/l] [select]
Re: Natural sorting by Not_a_Number (Prior) on Dec 15, 2007 at 09:59 UTC
Small aside: 'Œ'/'œ' is common in French, as in 'œuf' ('egg'), but is missing from your `%trans` hash.	[reply] [d/l]
Re^2: Natural sorting by thundergnat (Deacon) on Dec 15, 2007 at 16:56 UTC
True. But the 'Œ'/'œ' ligature is not in ISO-Latin-1. If I'm going to expand into other codepages, I would be better off using something like Text::Unidecode to decompose the characters. It would be a substantial performance hit, but much easier to maintain. It comes down to what is "good enough", I guess.	[reply]
Re: Natural sorting by Anonymous Monk on Sep 26, 2014 at 20:31 UTC
after seeing various complaints about Sort::Naturally then having questions/concerns about another option (Sort::Key::Natural) I "grew my own". it may not be "elegant" (it is kind of a brute force method) but it does seem to handle my cases OK. it may also be easier to customize because it is not using any "tricks". I have no problem with many CPAN packages but many of them seem to be confusing to use or seem to be rather large files for what should be simple functions and some do not seem to work well :(. this example may not be the fastest, most elegant, or "cleanest" code but it does the job I need it to do which for most cases boils down to doing things in the "Microsoft" file name order (for my benefit). for the other cases where I may use this I doubt anybody would notice something incorrect anyway. #figure this out later#my $separator_characters=" _-\\\\\\\."; # +these are the separator characters ready for use in a regex sub __natural_split_item($) { my @out_array; # # by brute force we will separate a string into the output array # by these cases as they are found: # # separator characters # numerics # alpha/non numeric # # we may need to add other stuff in the future... # # when we change to a different type we will push the previous dat +a into # the output array. # my $last_type=0; # set to no previous my $last_split=""; if(defined $_[0]) # we may get an undefined item.... { for(my $i=0;$i<length $_[0];$i++) { my $thischar=substr $_[0],$i,1; # will need to do +something for unicode if we need it.... my $thistype; # just do not use zero and do not duplica +te values. # # separator characters may need to be handled a little dif +ferently. # first, we need to change them into spaces and second we +need to # push them on the output array as single characters. thi +s may fix # the __ versus _ sort cases where __ shows up first but w +e do the _ # first. # # note that the number selected for $thistype does not mat +ter - we are # only looking for a change in separator type. # #figure this out later#if($thischar =~ /[$separator_charac +ters]/) # this character is a separator characte +r if($thischar =~ /[ _\\\.]/) # this ch +aracter is a separator character { $thistype=100; $thistype++ if($last_type == $thistype); + # repeated separators of any sort must be treated a bit different +ly $thischar=" "; } elsif($thischar =~ /^\d$/) { $thistype=200; } + # this character is a numeric character else { $thistype=300; } + # this character is a non numeric if($last_type!=0 # if we have a previous type (0 m +eans no previous) && $last_type!=$thistype # and we have changed type && $last_split ne "" # and we have something to pus +h (we should if we get to this test) ) { push @out_array,$last_split; $last_split=""; } $last_type=$thistype; $last_split .= $thischar; # add this character to the c +urrent } push @out_array,$last_split; # add the last component } return @out_array; } sub my_custom_natural_sort { my @a_list=__natural_split_item $a; my @b_list=__natural_split_item $b; for (my $i=0;defined $a_list[$i] && defined $b_list[$i];$i++) { next if $a_list[$i] eq $b_list[$i]; # move on to next ite +m if this split component is the same if(($a_list[$i]=~ /\d\d/) && ($b_list[$i]=~ /\d\d/)) # if +both numeric to a numeric test first { return $a_list[$i] <=> $b_list[$i] if $a_list[$i] != $b_li +st[$i]; # simple numeric compare if not same value return $a_list[$i] cmp $b_list[$i] if $a_list[$i] ne $b_li +st[$i]; # ascii compare so we can handle leading zero variationsif + not same value } return (lc $a_list[$i]) cmp (lc $b_list[$i]) if (lc $a_list[$i +]) ne (lc $b_list[$i]); # compare as same case first #return ($a_list[$i]) cmp ($b_list[$i]) if ($a_list[$i]) ne ($ +b_list[$i]); # then compare as case sensitive (maybe) } return (length @a_list) <=> (length @b_list); # just do list le +ngth - longer list should be later. if items the same then it does n +ot matter.... } [download]	[reply] [d/l]


The stupid question is the question not asked
	PerlMonks