http://qs321.pair.com?node_id=1192648


in reply to Re: Sort undef
in thread Sort undef

That's bloody brilliant.

One question, though. To my eye it looks to be vulnerable to the case where the original list has at least one element which starts with two or more chr(255)characters and at least one element being undef.

Or am I missing something?

Replies are listed 'Best First'.
Re^3: Sort undef
by Anonymous Monk on Jun 13, 2017 at 07:12 UTC
    I'm hoping deaccent will change chr(255) (y with two dots) to a plain y. It also fails if there strings starting with unicode characters above 255. Handling unicode in full generality is a huge pain, so I punted. You probably need something like Unicode::Collate to do it right.

      Use Unicode. Perl is quite good at that

      $ perl -MDP -we'my@x=("",undef,"1","123","2","\xff","\x{00ff}");DPeek +for map{$_->[1]}sort{$a->[0]cmp$b->[0]}map{[$_//"\x{1ffff}",$_]}@x' PV(""\0) PV("1"\0) PV("123"\0) PV("2"\0) PV("\377"\0) PV("\377"\0) UNDEF

      And chr(255) is *not* per definition an y with two dots. That is only the case in (encodings supported by perl (cp1252, cp1254, cp1258, hp-roman8, iso-8859-1, iso-8859-9, iso-8859-14, iso-8859-15, iso-8859-16, and UTF-7. If you don't specify the encoding or (lord forbids) *assume* any of the just listed, chr(255):

        7bit-jis                       \xFF
        cp1006                         ﹽ      ARABIC SHADDA MEDIAL FORM
        cp1026                                APPLICATION PROGRAM COMMAND
        cp1047                                APPLICATION PROGRAM COMMAND
        cp1250                         ˙      DOT ABOVE
        cp1251                         я      CYRILLIC SMALL LETTER YA
        cp1252                         ˙      LATIN SMALL LETTER Y WITH DIAERESIS
        cp1254                         ˙      LATIN SMALL LETTER Y WITH DIAERESIS
        cp1256                         ے      ARABIC LETTER YEH BARREE
        cp1257                         ˙      DOT ABOVE
        cp1258                         ˙      LATIN SMALL LETTER Y WITH DIAERESIS
        cp37                                  APPLICATION PROGRAM COMMAND
        cp424                                 APPLICATION PROGRAM COMMAND
        cp437                                 NO-BREAK SPACE
        cp500                                 APPLICATION PROGRAM COMMAND
        cp737                                 NO-BREAK SPACE
        cp775                                 NO-BREAK SPACE
        cp850                                 NO-BREAK SPACE
        cp852                                 NO-BREAK SPACE
        cp855                                 NO-BREAK SPACE
        cp856                                 NO-BREAK SPACE
        cp857                                 NO-BREAK SPACE
        cp858                                 NO-BREAK SPACE
        cp860                                 NO-BREAK SPACE
        cp861                                 NO-BREAK SPACE
        cp862                                 NO-BREAK SPACE
        cp863                                 NO-BREAK SPACE
        cp865                                 NO-BREAK SPACE
        cp866                                 NO-BREAK SPACE
        cp869                                 NO-BREAK SPACE
        cp875                                 APPLICATION PROGRAM COMMAND
        cp932                          
        cp936                          
        cp949                          
        cp950                          
        gsm0338                        ?      QUESTION MARK
        hp-roman8                      ˙      LATIN SMALL LETTER Y WITH DIAERESIS
        iso-2022-jp                    \xFF
        iso-2022-jp-1                  \xFF
        iso-2022-kr                    \xFF
        iso-8859-1                     ˙      LATIN SMALL LETTER Y WITH DIAERESIS
        iso-8859-10                    ĸ      LATIN SMALL LETTER KRA
        iso-8859-13                    ’      RIGHT SINGLE QUOTATION MARK
        iso-8859-14                    ˙      LATIN SMALL LETTER Y WITH DIAERESIS
        iso-8859-15                    ˙      LATIN SMALL LETTER Y WITH DIAERESIS
        iso-8859-16                    ˙      LATIN SMALL LETTER Y WITH DIAERESIS
        iso-8859-2                     ˙      DOT ABOVE
        iso-8859-3                     ˙      DOT ABOVE
        iso-8859-4                     ˙      DOT ABOVE
        iso-8859-5                     џ      CYRILLIC SMALL LETTER DZHE
        iso-8859-9                     ˙      LATIN SMALL LETTER Y WITH DIAERESIS
        koi8-f                         Ъ      CYRILLIC CAPITAL LETTER HARD SIGN
        koi8-r                         Ъ      CYRILLIC CAPITAL LETTER HARD SIGN
        koi8-u                         Ъ      CYRILLIC CAPITAL LETTER HARD SIGN
        MacArabic                      ے      ARABIC LETTER YEH BARREE
        MacCentralEurRoman             ˇ      CARON
        MacChineseSimp                 …      HORIZONTAL ELLIPSIS
        MacChineseTrad                 …      HORIZONTAL ELLIPSIS
        MacCroatian                    ˇ      CARON
        MacCyrillic                    €      EURO SIGN
        MacFarsi                       ے      ARABIC LETTER YEH BARREE
        MacGreek                              SOFT HYPHEN
        MacHebrew                      |      VERTICAL LINE
        MacIcelandic                   ˇ      CARON
        MacJapanese                    …
        MacKorean                      …
        MacRoman                       ˇ      CARON
        MacRomanian                    ˇ      CARON
        MacRumanian                    ˇ      CARON
        MacSami                        ǩ      LATIN SMALL LETTER K WITH CARON
        MacTurkish                     ˇ      CARON
        posix-bc                       ~      TILDE
        UTF-7                          ˙      LATIN SMALL LETTER Y WITH DIAERESIS
        viscii                         Ữ      LATIN CAPITAL LETTER U WITH HORN AND TILDE
      

      Enjoy, Have FUN! H.Merijn

        Touché.

        $ perl -MDP -we'my@x=("",undef,"1","123","2","\xff","\x{00ff}");DPeek for map{$_->[1]}sort{$a->[0]cmp$b->[0]}map{[$_//"\x{1ffff}",$_]}@x'
        Not sure what you're trying to prove with this. Do you think U+1ffff is the biggest Unicode character?
        And chr(255) is *not* per definition an y with two dots.
        Which is exactly why I said I was *hoping* it would get replaced with something less than 255. People around here need less sensitive nerd-rage triggers.

      It would seem that deaccent()would modify the data to a sub-255 value, leaving a single 255 in the Schwartian Transform as a viable sort max key -- as noted above, this should be proven before deployed.

      As to your other note, Unicode characters "above 255" are actually multi-byte sequences whose individual bytes still cannot exceed the architectural limitation of chr(255) so I question that perceived vulnerability.

        Perl strings are sequences of Unicode code points, not sequences of bytes. (Well, I think they're stored internally as plain bytes if possible.)
        use Encode qw( encode_utf8 ); my $x = chr(1 << 63); print length($x), "\n"; print length(encode_utf8($x)), "\n"; print "yep\n" if $x gt chr(255);
        Output:
        Use of code point 0x8000000000000000 is deprecated; the permissible ma +x is 0x7FFFFFFFFFFFFFFF at foo line 2. 1 13 yep