http://qs321.pair.com?node_id=1202470

cr8josh has asked for the wisdom of the Perl Monks concerning the following question:

Sort::Key::Natural seems to have different results based on capitalization. First letter uppercase sorts differently than both all upper and all lower case

use feature qw( say ); use Sort::Key::Natural qw( natsort ); say for natsort qw( P007b_Yum P007_Yum ); say qw(---); say for natsort qw( P007B_YUM P007_YUM ); say qw(---); say for natsort qw( P007b_yum P007_yum );

Output:

P007_Yum P007b_Yum --- P007B_YUM P007_YUM --- P007b_yum P007_yum

In the first case (uppercase first), the "P007b_" sorts after "P007_". In the other cases (all upper or all lower), it sorts after. As a result, I can't do case-insensitive sorting without changing the results. Can any monks help?

Thanks!

Replies are listed 'Best First'.
Re: Sort::Key::Natural sorting discrepancy
by tangent (Parson) on Nov 01, 2017 at 01:57 UTC
    I am not familiar with the module but in the docs it says:
    Spaces, symbols and non-printable characters are only considered for splitting the string into its parts but not for sorting. For instance foo-bar-42 is broken in three substrings foo, bar and 42 and after that the dashes are ignored.
    For the examples you give the sort order would then be:
     
    P007BYUM
    P007YUM 

     
    P007Yum 
    P007bYum

     
    P007byum
    P007yum 

     
    Which to my mind is correct. You might be able to use some variation of the Schwartzian Transform to achieve your desired output.
Re: Sort::Key::Natural sorting discrepancy
by 1nickt (Canon) on Oct 31, 2017 at 20:31 UTC

    Hi, you don't show your desired output, but as the doc for Sort::Key::Natural says:

    Note, that the sorting is case sensitive. To do a case insensitive sort you have to convert the keys explicitly

    Use natkeysort() instead:

    use Sort::Key::Natural qw( natkeysort ); my @list = qw(P007b_Yum P007_Yum P007B_YUM P007_YUM P007b_yum P007_yum +); say for natkeysort { lc $_ } @list;
    $ perl 1202470.pl P007b_Yum P007B_YUM P007b_yum P007_Yum P007_YUM P007_yum

    Hope this helps!


    The way forward always starts with a minimal test.

      Thanks! I was using that exact thing but my problem is that I actually want P007 first, then P007b, which is what I would expect from natural sorting (this is how Windows sorts filenames). Once I use natkeysort and force it to lowercase, it sorts backwards from what I expect. I don't understand why they would be different. This also gets the same switch of behavior:

      use feature qw( say ); use Sort::Key::Natural qw( natkeysort ); say for natkeysort {$_} qw( P007B_YUM P007_YUM ); say qw(---); say for natkeysort {$_} qw( P007b_Yum P007_Yum ); say qw(---); say for natkeysort {$_} qw( P007b_yum P007_yum );

      That gets the same switch of order.

      P007B_YUM P007_YUM --- P007_Yum P007b_Yum --- P007b_yum P007_yum

        Hm yeap, I see what you mean.

        P007b_Yum P007B_YUM P007b_yum P007_Yum P007_YUM P007_yum
        I would have expected:
        P007_yum P007_Yum P007_YUM P007b_yum P007b_Yum P007B_YUM
        Well, hopefully salva will come along soon and explain the mystery.


        The way forward always starts with a minimal test.
Re: Sort::Key::Natural sorting discrepancy
by kcott (Archbishop) on Nov 01, 2017 at 07:38 UTC

    G'day cr8josh,

    It's unclear what you want because you haven't shown that.

    Piecing together what's in your OP and "Re^2: Sort::Key::Natural sorting discrepancy", it seems you want a case-insensitive sort with underscores sorting before letters. In ASCII, uppercase letters (65-90) come before an underscore (95) which comes before lowercase letters (97-122). If that is indeed what you want, you can do it like this:

    $ perl -E ' my @x = ( [qw{P007b_Yum P007_Yum}], [qw{P007B_YUM P007_YUM}], [qw{P007b_yum P007_yum}] ); for (@x) { say $_->[0] for sort { $a->[1] cmp $b->[1] } map { [ $_, lc $_ ] } @$_; say "-" x 3; } ' P007_Yum P007b_Yum --- P007_YUM P007B_YUM --- P007_yum P007b_yum ---

    — Ken

Re: Sort::Key::Natural sorting discrepancy
by Laurent_R (Canon) on Nov 01, 2017 at 10:00 UTC
    Hi cr8josh,

    If I understand well what you want, this might be your solution:

    use strict; use warnings; use feature 'say'; my @list = qw(P007b_Yum P007_Yum P007B_YUM P007_YUM P007b_yum P007_yum +); my @sorted = sort { lc $a cmp lc $b } sort { $b cmp $a } @list; say for @sorted;
    This produces the following output:
    $ perl sort_lc.pl P007_yum P007_Yum P007_YUM P007b_yum P007b_Yum P007B_YUM
    The first sort (on the right-hand side) puts the list in the right order with respect to words having the same letters but in different cases, the second sort (on the left) then deals with underscores versus letters. This works because the sort algorithm used (merge sort) is stable, so that items that compare equal when changed to lower case (second sort) are kept in the same order as determined by the first sort.

    If this is still not the order you want, please let us know what you expect.

Re: Sort::Key::Natural sorting discrepancy
by salva (Canon) on Nov 01, 2017 at 13:20 UTC
    As tangent has already explained in his post, Sort::Key::Natural breaks the string at the boundaries between alphabetic and number groups. So for instance, P007b_Yum is equivalent to P_7_b_Yum.

    If you want alphanumeric groups to remain unbroken, you will have to change the sorting-key generation algorithm.

    This probably does what you want (untested):

    use Sort::Key qw(keysort); sub mknkey { my $n = shift; $n =~ s/^0+//; my $len = length $n; my $nines = int ($len / 9); my $rest = $len - 9 * $nines; ('9' x $nines) . $rest . $n; } sub mknatkey { my $nat = @_ ? shift : $_; $nat =~ s/(\d+)/mknkey($1)/ge; $nat; } my @sorted = keysort { mknatkey($_) } @data;

      Hi salva,

      Thanks for clarifying. Now that I know from your follow-up what the truth is, I can discern it in the docs. However, I humbly submit that the doc could be clearer.

      I misread the existing doc and concluded (encouraged in the wrong conclusion by the OP's code) that the string was split only on the underscores etc. (i.e.: 'boundaries' == 'underscores, etc.'). That's mostly my fault, for only seeing what I thought I was looking for, but here's a suggestion: place next to each other the descriptions of the two ways in which the string could be split, and highlight that alpha words and numbers are different, not a group:

      "Under natural sorting, strings are split at word and number boundaries, and the resulting ... "

      Possibly better as:

      "Under natural sorting, strings are split at boundaries between words and numbers and at non-alphanumeric characters (see below), and the resulting ..."

      Let me know if a patch / PR would be helpful, if you think this would be an improvement.

      Thanks for all you do for Perl!


      The way forward always starts with a minimal test.
        Following your advice, I have changed the description on the development version of the module.

      Many many thanks to all who chimed in. Apologies my goal was not clear! My high level goal is to reproduce Windows Explorer's file sorting. The only solution that worked for me consistently was calling Win32::API which I preferred not to do...

      Salva, you nailed it, I needed alphanumeric groups to sort unbroken, and your solution works perfectly. Thank you!