Perl Monk, Perl Meditation | |
PerlMonks |
Noodling with natural sorting in perl6by thundergnat (Deacon) |
on Aug 20, 2010 at 15:28 UTC ( [id://856304]=perlmeditation: print w/replies, xml ) | Need Help?? |
Update: fixed some minor spelling and formatting mistakes. I've been noodling around with perl6 lately since the release of Rakudo star and wanted to share some some of my hard learned knowledge. (Note: Some of the knowledge was hard learned because documentation is pretty sparse still, a bunch was just because I can be pretty dense sometimes. YMMV.) I looked around for something simple that is a pretty common procedure but is easy to implement incorrectly. I decided to try to do natural sorting. When you sort strings that contain numbers, standard lexical sorting will return the items in "ASCIIbetical" order. Digits before all upper case characters before all lower case characters. Natural sorting yields numbers (strings of digits) sorted by order of magnitude then by magnitude before alphabetical position (upper or lower case). Note that this is not an "official canonical" definition, just how I am defining it, and is not, in fact, the order the perl5 Sort::Naturally module yields... but that's a rant for another day. Anyway... A group of lexically sorted strings might look like this: 0 1 100 11 144th 2 21 210 3rd 33rd AND ARE An Bit Can and by car d1 d10 d2 The same group sorted naturally: 0 1 2 3rd 11 21 33rd 100 144th 210 An AND and ARE Bit by Can car d1 d2 d10 I want to implement two routines: natural_sort, that will take an array of strings and return a sorted array, and natural_cmp which can be used as sort modifier similar to cmp. I will present them, then dissect them.
sub natural_sort {
Just like Perl 5. There are some other subtleties but none used here. return @_ if @_ < 2;If there are less than two items in the list, there isn't much point in sorting it. :-) my @words = @_.map({ .lc });Make a copy of the passed in list of words, mapping an anonymous sub to each element in the list. In this case: lower casing it as you make the copy. The . is the method operator, not concatenation like perl5. When you don't specify a target it acts on the implicit variable $_, so '.lc' is equivalent to '$_.lc'. my $index = 0;Initialize an index variable to 0. In perl5 there is some magic attached to an undefined post incremented/decremented value evaluated in string context. It evaluates to '0'. In perl6 it evaluates to 'Any()'. Interesting, but not very useful as an array index. for @words {Iterate through the array, setting $_ to value of the element at each iteration. You don't need to enclose the array in parenthesis but there NEEDS to be a space between the array and opening brace. .=subst(/(\d+)/, -> $/{ sprintf( "%s%c%s", 0, $0.chars, $0) }, :g);This line does a lot. '.=' means act on the implicit $_, perform the attached method then set the value equal to the result. subst() is another way of writing the s/// operator as a method. Inside the subst method I have the comma separated list of the match variable, the replacement variable, and the regex modifier. The match variable looks the same as perl5; a regex capturing a string of one or more consecutive digits. For the replacement variable I am executing an anonymous sub (enclosed in braces). The '-> $/' construct is a closure making the match value array available in the inner scope of the of the anonymous sub. By default the inner scope doesn't have access to the implicit variables in the outer scope. (As I understand it, this may change in the future.) The regex modifier is ':g', do it globally. The anonymous sub replaces the matched digit string with a string made up of three pieces. First the stringified digit zero to make sure digit strings are sorted into the correct ASCII range - chr 48 through chr 58. I could use any digit as long as it falls in that range and they are all the same one. I chose 0. Second the length of the digit string ($0.chars, length doesn't exist anymore,) encoded as an ascii character. Third, the actual string of digits. pack would probably be faster and less problem prone than sprintf but it hasn't been implemented yet in Rakudo star (as of R* 2010.07). The length of any digit sequence is limited to 255 characters since I'm using a character (8 bits) to encode it. Using pack would allow encoding as a network ordered long (32 bits)... and would probably be faster anyway. $_ ~= ' ' ~ $index++;Concatenate '~' a string to the end of the element consisting of a space then the index of the element. '~=' means concatenate then set the value equal to. It doesn't work implicitly so I need to specify the $_ variable. return @_[@words.sort.map({@(split(' ',$^_))[*-1]})];This is the meat of the routine. Return an array slice (@_[]) consisting of the passed in @_ array using the given indici list. The indici list consists of the @word array, sorted lexically with an anonymous sub mapped to each element. The sub splits the parameter $^_ on spaces, then returns the last element of the anonymous array formed by the values returned by split; which was the index value I added earlier. Some commentary: The named parameter $^_ could have been called anything $^_ is what I choose to name it. I could have called it $^fred and it would work the same. The $^ twigil means it is a local positional parameter, only valid within the scope of that block. The syntax to access array elements indexed from the end requires the asterisk. [-1] is now a syntax error. Whew... The natural_cmp routine will use pretty similar funtionality except it will only look at the first two values of the passed parameters, doesn't need to track the index since there is only two and returns -1, 0 or 1 depending on whether the values evaluate to less than, equal or greater than.
This works fine but really needs some kind of error checking. What if there is only one value in @_? Or three values? This is where positional parameters shine. I rewrote using the positional parameters $a and $b. $a and $b no longer have any global significance and you are free to use them as normal variable names, I just used them to echo how perl5 works. When you use positional parameters, the perl6 runtime will throw an exception if the wrong number of parameters is received. They are read-only though, so I need to make a copy of them to work with.
Now to use these subroutines in a script. Pass a file to the script. It will open it, make a hash of the "words" in the file then print two lists. The first is the list of words sorted naturally. The second is the hash of words sorted by the number of times the word is seen with a secondary natural sort. (Each group of words that appears the same number of times is sorted naturally.) Don't pass in a big file. This works but it is slow.
Questions, comments, and corrections welcome.
Back to
Meditations
|
|