Noodling with natural sorting in perl6

Update: fixed some minor spelling and formatting mistakes.

I've been noodling around with perl6 lately since the release of Rakudo star and wanted to share some some of my hard learned knowledge. (Note: Some of the knowledge was hard learned because documentation is pretty sparse still, a bunch was just because I can be pretty dense sometimes. YMMV.)

I looked around for something simple that is a pretty common procedure but is easy to implement incorrectly. I decided to try to do natural sorting.

When you sort strings that contain numbers, standard lexical sorting will return the items in "ASCIIbetical" order. Digits before all upper case characters before all lower case characters. Natural sorting yields numbers (strings of digits) sorted by order of magnitude then by magnitude before alphabetical position (upper or lower case). Note that this is not an "official canonical" definition, just how I am defining it, and is not, in fact, the order the perl5 Sort::Naturally module yields... but that's a rant for another day. Anyway... A group of lexically sorted strings might look like this:

0 1 100 11 144th 2 21 210 3rd 33rd AND ARE An Bit Can and by car d1 d10 d2

The same group sorted naturally:

0 1 2 3rd 11 21 33rd 100 144th 210 An AND and ARE Bit by Can car d1 d2 d10

I want to implement two routines: natural_sort, that will take an array of strings and return a sorted array, and natural_cmp which can be used as sort modifier similar to cmp. I will present them, then dissect them.

sub natural_sort {
    return @_ if @_ < 2;
    my @words = @_.map({ .lc });
    my $index = 0;
    for @words {
       .=subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", 0, $0.chars, $0) }, 
+:g);
       $_ ~= ' ' ~ $index++;
    };
    return @_[@words.sort.map({@(split(' ',$^_))[*-1]})];
}
[download]

sub natural_sort {

Just like Perl 5. There are some other subtleties but none used here.

return @_ if @_ < 2;

If there are less than two items in the list, there isn't much point in sorting it. :-)

my @words = @_.map({ .lc });

Make a copy of the passed in list of words, mapping an anonymous sub to each element in the list. In this case: lower casing it as you make the copy. The . is the method operator, not concatenation like perl5. When you don't specify a target it acts on the implicit variable $_, so '.lc' is equivalent to '$_.lc'.

my $index = 0;

Initialize an index variable to 0. In perl5 there is some magic attached to an undefined post incremented/decremented value evaluated in string context. It evaluates to '0'. In perl6 it evaluates to 'Any()'. Interesting, but not very useful as an array index.

for @words {

Iterate through the array, setting $_ to value of the element at each iteration. You don't need to enclose the array in parenthesis but there NEEDS to be a space between the array and opening brace.

.=subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", 0, $0.chars, $0) }, :g);

This line does a lot. '.=' means act on the implicit $_, perform the attached method then set the value equal to the result. subst() is another way of writing the s/// operator as a method. Inside the subst method I have the comma separated list of the match variable, the replacement variable, and the regex modifier. The match variable looks the same as perl5; a regex capturing a string of one or more consecutive digits. For the replacement variable I am executing an anonymous sub (enclosed in braces). The '-> $/' construct is a closure making the match value array available in the inner scope of the of the anonymous sub. By default the inner scope doesn't have access to the implicit variables in the outer scope. (As I understand it, this may change in the future.) The regex modifier is ':g', do it globally.

The anonymous sub replaces the matched digit string with a string made up of three pieces. First the stringified digit zero to make sure digit strings are sorted into the correct ASCII range - chr 48 through chr 58. I could use any digit as long as it falls in that range and they are all the same one. I chose 0. Second the length of the digit string ($0.chars, length doesn't exist anymore,) encoded as an ascii character. Third, the actual string of digits.

pack would probably be faster and less problem prone than sprintf but it hasn't been implemented yet in Rakudo star (as of R* 2010.07). The length of any digit sequence is limited to 255 characters since I'm using a character (8 bits) to encode it. Using pack would allow encoding as a network ordered long (32 bits)... and would probably be faster anyway.

$_ ~= ' ' ~ $index++;

Concatenate '~' a string to the end of the element consisting of a space then the index of the element. '~=' means concatenate then set the value equal to. It doesn't work implicitly so I need to specify the $_ variable.

return @_[@words.sort.map({@(split(' ',$^_))[*-1]})];

This is the meat of the routine. Return an array slice (@_[]) consisting of the passed in @_ array using the given indici list. The indici list consists of the @word array, sorted lexically with an anonymous sub mapped to each element. The sub splits the parameter $^_ on spaces, then returns the last element of the anonymous array formed by the values returned by split; which was the index value I added earlier.

Some commentary: The named parameter $^_ could have been called anything $^_ is what I choose to name it. I could have called it $^fred and it would work the same. The $^ twigil means it is a local positional parameter, only valid within the scope of that block.

The syntax to access array elements indexed from the end requires the asterisk. [-1] is now a syntax error.

Whew...

The natural_cmp routine will use pretty similar funtionality except it will only look at the first two values of the passed parameters, doesn't need to track the index since there is only two and returns -1, 0 or 1 depending on whether the values evaluate to less than, equal or greater than.

sub natural_cmp {
    my ($one, $two) = @_[0,1].map({ .=subst(/(\d+)/, 
      -> $/{ sprintf( "%s%c%s", 0, $0.chars, $0) }, :g).lc });
    return ($one cmp $two);
}
[download]

This works fine but really needs some kind of error checking. What if there is only one value in @_? Or three values? This is where positional parameters shine. I rewrote using the positional parameters $a and $b. $a and $b no longer have any global significance and you are free to use them as normal variable names, I just used them to echo how perl5 works. When you use positional parameters, the perl6 runtime will throw an exception if the wrong number of parameters is received. They are read-only though, so I need to make a copy of them to work with.

sub natural_cmp ($a, $b) {
    my ($first, $second) = ($a, $b);
    my ($one, $two) = ($first, $second).map({ .=subst(/(\d+)/, 
      -> $/{ sprintf( "%s%c%s", 0, $0.chars, $0) }, :g).lc });
    return ($one cmp $two);
}
[download]

Now to use these subroutines in a script. Pass a file to the script. It will open it, make a hash of the "words" in the file then print two lists. The first is the list of words sorted naturally. The second is the hash of words sorted by the number of times the word is seen with a secondary natural sort. (Each group of words that appears the same number of times is sorted naturally.) Don't pass in a big file. This works but it is slow.

use v6;
    # perl 6. automatic strict, warnings and autodie

my $filename = shift @*ARGS or die "You need to pass in a filename.";
    # pretty similar to perl5, @ARGV is now @*ARGS

my $handle = open $filename;
    # autodie with useful message if file can't be opened.
    # override with "no autodie" if desired.

my %words;

for $handle.lines -> $line { 
    # iterate through the file, stuff the line into $line on each iter
+ation
    %words{ $_ }++ for $line.comb(rx/ \w+ [<punct> \w+]* /);
        # .comb is sort of the opposite of split. rather than splittin
+g on what 
        # you don't want, comb out everything you do. 
        # the capturing regex looks for words consisting of one or mor
+e word characters
        # optionally followed by any amount of grouped punctuation mar
+ks and one or more word
        # characters. this might not be the best "word" definition but
+ I find it useful.
        # it will match (*most) email addresses, dotted quad IP addres
+s and contractions (can't, won't)
        # as a single entity 
}


.say for natural_sort(%words.keys);
    # print the list of words sorted naturally. works pretty well. 

.say for %words.sort: { $^b.value <=> $^a.value || natural_cmp($^a,$^b
+) };
    # print the list of words sorted by number of times seen with a se
+condary
    # natural sort. works but s l o w


sub natural_sort {
    return @_ if @_ < 2;
    my @words = @_.map({ .lc });
    my $index = 0;
    for @words {
       .=subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", 0, $0.chars, $0) }, 
+:g);
       $_ ~= ' ' ~ $index++;
    };
    return @_[@words.sort.map({@(split(' ',$^_))[*-1]})];
}


sub natural_cmp ($a, $b) {
    my ($first, $second) = ($a, $b);
    my ($one, $two) = ($first, $second).map({ .=subst(/(\d+)/, 
      -> $/{ sprintf( "%s%c%s", 0, $0.chars, $0) }, :g).lc });
    return ($one cmp $two);
}
[download]

Questions, comments, and corrections welcome.

Comment on Noodling with natural sorting in perl6 Select or Download Code

Replies are listed 'Best First'.
Re: Noodling with natural sorting in perl6 by moritz (Cardinal) on Aug 20, 2010 at 16:13 UTC
Thanks for your writeup! Below I'll make some suggestion of how you could take advantage of some nice Perl 6 features, and write more idiomatic Perl 6 code: `my $filename = shift @ARGS or die "You need to pass in a filename.";` [download] Instead I usually write `sub MAIN($filename) { # rest of the mainline code here }` [download] The MAIN sub does the command line parsing for you, and also gives an automatic usage message. `rx/ \w+ [<punct> \w+] /` This also be written as `rx/ [\w+] <punct> /`. Here `` is the general quantifier, `a 4` matches 4 a's, `a <punct>` matches an arbitrary number of a's, separated by punctuation. `sub natural_cmp ($a, $b) { my ($first, $second) = ($a, $b); my ($one, $two) = ($first, $second).map({ .=subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", 0, $0.chars, $0) }, :g).lc }); return ($one cmp $two); }` [download] If you leave out the `=` in `.=subst`, no in-place modification takes place; instead the modified value is returned, and the original variable remains untouched. So you can write `sub natural_cmp ($a, $b) { my ($one, $two) = ($a, $b).map(*.subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", 0, $0.chars, $0) }, :g).lc); return $one cmp $two; }` [download] If you happen to need a copy, you can also say `sub mysub($a is copy, $b is copy)`, so you don't need to come up with more variable names. Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l] [select]
Re^2: Noodling with natural sorting in perl6 by thundergnat (Deacon) on Aug 20, 2010 at 18:28 UTC
`rx/ [\w+] ** <punct> /` Cool, thanks. I am still stuck in perl5 regex think to a large extent. `sub natural_cmp ($a, $b) { my ($one, $two) = ($a, $b).map(*.subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", 0, $0.chars, $0) }, :g).lc); return $one cmp $two; }` [download] Ooooo shiney! Like I said, I can be dense sometimes. Even better, do away with all the intermediate variables. `sub natural_cmp ($a is copy, $b is copy) { return $a.subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", 0, $0.chars, $0) + }, :g).lc cmp $b.subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", 0, $0.chars, $0) }, + :g).lc; }` [download]	[reply] [d/l] [select]
Re^3: Noodling with natural sorting in perl6 by moritz (Cardinal) on Aug 20, 2010 at 20:26 UTC
That way you duplicated the transformation code. Here's a version that uses neither temporaries nor duplicated code: `sub natural_cmp ($a is copy, $b is copy) { return [cmp] ($a, ).map: *.subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", + 0, $0.chars, $0) }, :g).lc }` [download] This uses the reduction meta operator. Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l]
Re: Noodling with natural sorting in perl6 by eric256 (Parson) on Aug 20, 2010 at 16:57 UTC
I played around and it seems that creating the copies is where a lot of time is lost in the second version causeing it to be slow. So i moved the processing out of the map and into a function, the cached results out of it, and used that for the sort. Makes it appear about the same speed as the first sort, not positive the best way to benchmark in perl6 yet though. `my %cache; sub pre_process($word) { unless %cache.exists($word) { %cache{$word} = $word.subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", +0 , $0.chars, $0)}, :g).lc; } return %cache{$word}; } sub natural_cmp ($a, $b) { return (pre_process($a) cmp pre_process($b)); }` [download] ___________ Eric Hodges	[reply] [d/l]
Re^2: Noodling with natural sorting in perl6 by thundergnat (Deacon) on Aug 20, 2010 at 18:12 UTC
A very nice modification, one that should have been obvious to me in retrospect. You're basically implementing an orcish maneuver	[reply]
Re: Noodling with natural sorting in perl6 by thundergnat (Deacon) on Aug 25, 2010 at 18:17 UTC
And now, even more perl sixy... natural_sort can be rewritten to do no copying and use no intermediate variables. It's kind of obfu but a lot more memory efficient. The � is the hyper operator. It is sort of like map. It means apply that method to each element in the supplied list. `sub natural_sort { return @_ if @_ < 2; my $index = 0; return @_[@_�.lc�.subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", 0, $0.c +hars, $0) }, :g).map( ~' '~$index++ ).sort.map({@(split(' ',$^_))[- +1]})]; }` [download] natural_cmp can also use the hyper operator and is now using leg instead of cmp. leg is the string specific comparison operator. (l)ess-(e)qual-(g)reater `sub natural_cmp ($a, $b) { return [leg] ($a, $b)�.lc�.subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", + 0, $0.chars, $0) }, :g); }` [download] A cached version could look something like below. Note: the cached version actually runs slower than the uncached version right now because of a bug in Rakudo * 2010.07. It evaluates both sides of the //= even if the lhs is defined. :-/ That should be fixed in the near future though. :-) `my %cache; sub cached_natural_cmp ($a, $b) { return [leg] ($a, $b)�.lc.map( %cache{ * } //= .subst(/(\d+)/, -> + $/{ sprintf( "%s%c%s", 0, $0.chars, $0) }, :g)); }` [download] Update:* Actually, since perl6 has schwartzian transform semantics built into .sort, this could be very concisely rewritten as: `@array.sort(.lc.subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", 0, $0.chars +, $0) }, :g));` [download] or, my favorite: `@array.sort: natural(); sub natural ($) { .lc.subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", 0, $0. +chars, $0) }, :g) }` [download] used as `.say for %words.keys.sort: natural(); or %words.keys.sort( natural(*) )�.say;` [download] That's just beautiful.	[reply] [d/l] [select]


Perl Monk, Perl Meditation
	PerlMonks