Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Sort undef

by Anonymous Monk
on Jun 12, 2017 at 14:42 UTC ( [id://1192595]=note: print w/replies, xml ) Need Help??


in reply to Sort undef

Surprised nobody's suggested the Schwartzian transform on this one. Combined with marinersk's sentinel technique, it's a nice, simple, efficient solution.
@$ResultsFinal = map { $_->[1] } sort { $a->[0] cmp $b->[0] } map { [deaccent($_->[$OptOrderToDisplayTable]) // chr(255), $_] } @$ResultsFinal;

Replies are listed 'Best First'.
Re^2: Sort undef
by marinersk (Priest) on Jun 13, 2017 at 03:46 UTC

    That's bloody brilliant.

    One question, though. To my eye it looks to be vulnerable to the case where the original list has at least one element which starts with two or more chr(255)characters and at least one element being undef.

    Or am I missing something?

      I'm hoping deaccent will change chr(255) (y with two dots) to a plain y. It also fails if there strings starting with unicode characters above 255. Handling unicode in full generality is a huge pain, so I punted. You probably need something like Unicode::Collate to do it right.

        Use Unicode. Perl is quite good at that

        $ perl -MDP -we'my@x=("",undef,"1","123","2","\xff","\x{00ff}");DPeek +for map{$_->[1]}sort{$a->[0]cmp$b->[0]}map{[$_//"\x{1ffff}",$_]}@x' PV(""\0) PV("1"\0) PV("123"\0) PV("2"\0) PV("\377"\0) PV("\377"\0) UNDEF

        And chr(255) is *not* per definition an y with two dots. That is only the case in (encodings supported by perl (cp1252, cp1254, cp1258, hp-roman8, iso-8859-1, iso-8859-9, iso-8859-14, iso-8859-15, iso-8859-16, and UTF-7. If you don't specify the encoding or (lord forbids) *assume* any of the just listed, chr(255):

          7bit-jis                       \xFF
          cp1006                         ﹽ      ARABIC SHADDA MEDIAL FORM
          cp1026                                APPLICATION PROGRAM COMMAND
          cp1047                                APPLICATION PROGRAM COMMAND
          cp1250                         ˙      DOT ABOVE
          cp1251                         я      CYRILLIC SMALL LETTER YA
          cp1252                         ÿ      LATIN SMALL LETTER Y WITH DIAERESIS
          cp1254                         ÿ      LATIN SMALL LETTER Y WITH DIAERESIS
          cp1256                         ے      ARABIC LETTER YEH BARREE
          cp1257                         ˙      DOT ABOVE
          cp1258                         ÿ      LATIN SMALL LETTER Y WITH DIAERESIS
          cp37                                  APPLICATION PROGRAM COMMAND
          cp424                                 APPLICATION PROGRAM COMMAND
          cp437                                 NO-BREAK SPACE
          cp500                                 APPLICATION PROGRAM COMMAND
          cp737                                 NO-BREAK SPACE
          cp775                                 NO-BREAK SPACE
          cp850                                 NO-BREAK SPACE
          cp852                                 NO-BREAK SPACE
          cp855                                 NO-BREAK SPACE
          cp856                                 NO-BREAK SPACE
          cp857                                 NO-BREAK SPACE
          cp858                                 NO-BREAK SPACE
          cp860                                 NO-BREAK SPACE
          cp861                                 NO-BREAK SPACE
          cp862                                 NO-BREAK SPACE
          cp863                                 NO-BREAK SPACE
          cp865                                 NO-BREAK SPACE
          cp866                                 NO-BREAK SPACE
          cp869                                 NO-BREAK SPACE
          cp875                                 APPLICATION PROGRAM COMMAND
          cp932                          
          cp936                          
          cp949                          
          cp950                          
          gsm0338                        ?      QUESTION MARK
          hp-roman8                      ÿ      LATIN SMALL LETTER Y WITH DIAERESIS
          iso-2022-jp                    \xFF
          iso-2022-jp-1                  \xFF
          iso-2022-kr                    \xFF
          iso-8859-1                     ÿ      LATIN SMALL LETTER Y WITH DIAERESIS
          iso-8859-10                    ĸ      LATIN SMALL LETTER KRA
          iso-8859-13                    ’      RIGHT SINGLE QUOTATION MARK
          iso-8859-14                    ÿ      LATIN SMALL LETTER Y WITH DIAERESIS
          iso-8859-15                    ÿ      LATIN SMALL LETTER Y WITH DIAERESIS
          iso-8859-16                    ÿ      LATIN SMALL LETTER Y WITH DIAERESIS
          iso-8859-2                     ˙      DOT ABOVE
          iso-8859-3                     ˙      DOT ABOVE
          iso-8859-4                     ˙      DOT ABOVE
          iso-8859-5                     џ      CYRILLIC SMALL LETTER DZHE
          iso-8859-9                     ÿ      LATIN SMALL LETTER Y WITH DIAERESIS
          koi8-f                         Ъ      CYRILLIC CAPITAL LETTER HARD SIGN
          koi8-r                         Ъ      CYRILLIC CAPITAL LETTER HARD SIGN
          koi8-u                         Ъ      CYRILLIC CAPITAL LETTER HARD SIGN
          MacArabic                      ے      ARABIC LETTER YEH BARREE
          MacCentralEurRoman             ˇ      CARON
          MacChineseSimp                 …      HORIZONTAL ELLIPSIS
          MacChineseTrad                 …      HORIZONTAL ELLIPSIS
          MacCroatian                    ˇ      CARON
          MacCyrillic                    €      EURO SIGN
          MacFarsi                       ے      ARABIC LETTER YEH BARREE
          MacGreek                              SOFT HYPHEN
          MacHebrew                      |      VERTICAL LINE
          MacIcelandic                   ˇ      CARON
          MacJapanese                    …
          MacKorean                      …
          MacRoman                       ˇ      CARON
          MacRomanian                    ˇ      CARON
          MacRumanian                    ˇ      CARON
          MacSami                        ǩ      LATIN SMALL LETTER K WITH CARON
          MacTurkish                     ˇ      CARON
          posix-bc                       ~      TILDE
          UTF-7                          ÿ      LATIN SMALL LETTER Y WITH DIAERESIS
          viscii                         Ữ      LATIN CAPITAL LETTER U WITH HORN AND TILDE
        

        Enjoy, Have FUN! H.Merijn

        It would seem that deaccent()would modify the data to a sub-255 value, leaving a single 255 in the Schwartian Transform as a viable sort max key -- as noted above, this should be proven before deployed.

        As to your other note, Unicode characters "above 255" are actually multi-byte sequences whose individual bytes still cannot exceed the architectural limitation of chr(255) so I question that perceived vulnerability.

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1192595]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-25 19:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found