http://qs321.pair.com?node_id=1192665


in reply to Re^2: Sort undef
in thread Sort undef

I'm hoping deaccent will change chr(255) (y with two dots) to a plain y. It also fails if there strings starting with unicode characters above 255. Handling unicode in full generality is a huge pain, so I punted. You probably need something like Unicode::Collate to do it right.

Replies are listed 'Best First'.
Re^4: Sort undef
by Tux (Canon) on Jun 14, 2017 at 09:26 UTC

    Use Unicode. Perl is quite good at that

    $ perl -MDP -we'my@x=("",undef,"1","123","2","\xff","\x{00ff}");DPeek +for map{$_->[1]}sort{$a->[0]cmp$b->[0]}map{[$_//"\x{1ffff}",$_]}@x' PV(""\0) PV("1"\0) PV("123"\0) PV("2"\0) PV("\377"\0) PV("\377"\0) UNDEF

    And chr(255) is *not* per definition an y with two dots. That is only the case in (encodings supported by perl (cp1252, cp1254, cp1258, hp-roman8, iso-8859-1, iso-8859-9, iso-8859-14, iso-8859-15, iso-8859-16, and UTF-7. If you don't specify the encoding or (lord forbids) *assume* any of the just listed, chr(255):

      7bit-jis                       \xFF
      cp1006                         ﹽ      ARABIC SHADDA MEDIAL FORM
      cp1026                                APPLICATION PROGRAM COMMAND
      cp1047                                APPLICATION PROGRAM COMMAND
      cp1250                         ˙      DOT ABOVE
      cp1251                         я      CYRILLIC SMALL LETTER YA
      cp1252                               LATIN SMALL LETTER Y WITH DIAERESIS
      cp1254                               LATIN SMALL LETTER Y WITH DIAERESIS
      cp1256                         ے      ARABIC LETTER YEH BARREE
      cp1257                         ˙      DOT ABOVE
      cp1258                               LATIN SMALL LETTER Y WITH DIAERESIS
      cp37                                  APPLICATION PROGRAM COMMAND
      cp424                                 APPLICATION PROGRAM COMMAND
      cp437                                NO-BREAK SPACE
      cp500                                 APPLICATION PROGRAM COMMAND
      cp737                                NO-BREAK SPACE
      cp775                                NO-BREAK SPACE
      cp850                                NO-BREAK SPACE
      cp852                                NO-BREAK SPACE
      cp855                                NO-BREAK SPACE
      cp856                                NO-BREAK SPACE
      cp857                                NO-BREAK SPACE
      cp858                                NO-BREAK SPACE
      cp860                                NO-BREAK SPACE
      cp861                                NO-BREAK SPACE
      cp862                                NO-BREAK SPACE
      cp863                                NO-BREAK SPACE
      cp865                                NO-BREAK SPACE
      cp866                                NO-BREAK SPACE
      cp869                                NO-BREAK SPACE
      cp875                                 APPLICATION PROGRAM COMMAND
      cp932                          
      cp936                          
      cp949                          
      cp950                          
      gsm0338                        ?      QUESTION MARK
      hp-roman8                            LATIN SMALL LETTER Y WITH DIAERESIS
      iso-2022-jp                    \xFF
      iso-2022-jp-1                  \xFF
      iso-2022-kr                    \xFF
      iso-8859-1                           LATIN SMALL LETTER Y WITH DIAERESIS
      iso-8859-10                    ĸ      LATIN SMALL LETTER KRA
      iso-8859-13                          RIGHT SINGLE QUOTATION MARK
      iso-8859-14                          LATIN SMALL LETTER Y WITH DIAERESIS
      iso-8859-15                          LATIN SMALL LETTER Y WITH DIAERESIS
      iso-8859-16                          LATIN SMALL LETTER Y WITH DIAERESIS
      iso-8859-2                     ˙      DOT ABOVE
      iso-8859-3                     ˙      DOT ABOVE
      iso-8859-4                     ˙      DOT ABOVE
      iso-8859-5                     џ      CYRILLIC SMALL LETTER DZHE
      iso-8859-9                           LATIN SMALL LETTER Y WITH DIAERESIS
      koi8-f                         Ъ      CYRILLIC CAPITAL LETTER HARD SIGN
      koi8-r                         Ъ      CYRILLIC CAPITAL LETTER HARD SIGN
      koi8-u                         Ъ      CYRILLIC CAPITAL LETTER HARD SIGN
      MacArabic                      ے      ARABIC LETTER YEH BARREE
      MacCentralEurRoman             ˇ      CARON
      MacChineseSimp                       HORIZONTAL ELLIPSIS
      MacChineseTrad                       HORIZONTAL ELLIPSIS
      MacCroatian                    ˇ      CARON
      MacCyrillic                          EURO SIGN
      MacFarsi                       ے      ARABIC LETTER YEH BARREE
      MacGreek                              SOFT HYPHEN
      MacHebrew                      |      VERTICAL LINE
      MacIcelandic                   ˇ      CARON
      MacJapanese                    
      MacKorean                      
      MacRoman                       ˇ      CARON
      MacRomanian                    ˇ      CARON
      MacRumanian                    ˇ      CARON
      MacSami                        ǩ      LATIN SMALL LETTER K WITH CARON
      MacTurkish                     ˇ      CARON
      posix-bc                       ~      TILDE
      UTF-7                                LATIN SMALL LETTER Y WITH DIAERESIS
      viscii                         Ữ      LATIN CAPITAL LETTER U WITH HORN AND TILDE
    

    Enjoy, Have FUN! H.Merijn

      Touch.

      $ perl -MDP -we'my@x=("",undef,"1","123","2","\xff","\x{00ff}");DPeek for map{$_->[1]}sort{$a->[0]cmp$b->[0]}map{[$_//"\x{1ffff}",$_]}@x'
      Not sure what you're trying to prove with this. Do you think U+1ffff is the biggest Unicode character?
      And chr(255) is *not* per definition an y with two dots.
      Which is exactly why I said I was *hoping* it would get replaced with something less than 255. People around here need less sensitive nerd-rage triggers.
Re^4: Sort undef
by marinersk (Priest) on Jun 14, 2017 at 08:17 UTC

    It would seem that deaccent()would modify the data to a sub-255 value, leaving a single 255 in the Schwartian Transform as a viable sort max key -- as noted above, this should be proven before deployed.

    As to your other note, Unicode characters "above 255" are actually multi-byte sequences whose individual bytes still cannot exceed the architectural limitation of chr(255) so I question that perceived vulnerability.

      Perl strings are sequences of Unicode code points, not sequences of bytes. (Well, I think they're stored internally as plain bytes if possible.)
      use Encode qw( encode_utf8 ); my $x = chr(1 << 63); print length($x), "\n"; print length(encode_utf8($x)), "\n"; print "yep\n" if $x gt chr(255);
      Output:
      Use of code point 0x8000000000000000 is deprecated; the permissible ma +x is 0x7FFFFFFFFFFFFFFF at foo line 2. 1 13 yep

        Going back to my attitude when I was a C programmer: It pays to know how your compiler thinks. (Needs adjustment for application to modern use of Perl, but the sentiment is the same.)

        Adding/replacing these three lines into my original script above:

        use Encode qw( encode_utf8 ); my $x = chr(1 << 63); my @Unsorted = ( 'Dog', 'Cat', 'Bird', undef, $x, 'Elephant', undef, ' +Lizard' );

        Yields:

        S:\Steve\Dev\PerlMonks\P-2017-06-12@0734-sort-undef>perl .\sort011.pl ---------------------------------------------------------------------- +--------- Original: ---------------------------------------------------------------------- +--------- Dog Cat Bird (undef) Wide character in print at .\sort011.pl line 58. Elephant (undef) Lizard ---------------------------------------------------------------------- +--------- ---------------------------------------------------------------------- +--------- Custom Sort: ---------------------------------------------------------------------- +--------- Bird Cat Dog Elephant Lizard (undef) (undef) Wide character in print at .\sort011.pl line 58. ---------------------------------------------------------------------- +--------- S:\Steve\Dev\PerlMonks\P-2017-06-12@0734-sort-undef>

        A string of chr(255)bytes longer than the longest item in the original array still fails to sort to the bottom; knowing that Unicode characters are stored differently than old-fashioned ASCII strings empowers the Perl programmer to make a better choice.

        Thank you for the information!

        I'd upvote the post, but there isn't any point, as it's Anonymous Monk.