in reply to Re^2: Sort undef
in thread Sort undef
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^4: Sort undef
by Tux (Canon) on Jun 14, 2017 at 09:26 UTC | |
Use Unicode. Perl is quite good at that
And chr(255) is *not* per definition an y with two dots. That is only the case in (encodings supported by perl (cp1252, cp1254, cp1258, hp-roman8, iso-8859-1, iso-8859-9, iso-8859-14, iso-8859-15, iso-8859-16, and UTF-7. If you don't specify the encoding or (lord forbids) *assume* any of the just listed, chr(255): 7bit-jis \xFF cp1006 ﹽ ARABIC SHADDA MEDIAL FORM cp1026 APPLICATION PROGRAM COMMAND cp1047 APPLICATION PROGRAM COMMAND cp1250 ˙ DOT ABOVE cp1251 я CYRILLIC SMALL LETTER YA cp1252 ÿ LATIN SMALL LETTER Y WITH DIAERESIS cp1254 ÿ LATIN SMALL LETTER Y WITH DIAERESIS cp1256 ے ARABIC LETTER YEH BARREE cp1257 ˙ DOT ABOVE cp1258 ÿ LATIN SMALL LETTER Y WITH DIAERESIS cp37 APPLICATION PROGRAM COMMAND cp424 APPLICATION PROGRAM COMMAND cp437 NO-BREAK SPACE cp500 APPLICATION PROGRAM COMMAND cp737 NO-BREAK SPACE cp775 NO-BREAK SPACE cp850 NO-BREAK SPACE cp852 NO-BREAK SPACE cp855 NO-BREAK SPACE cp856 NO-BREAK SPACE cp857 NO-BREAK SPACE cp858 NO-BREAK SPACE cp860 NO-BREAK SPACE cp861 NO-BREAK SPACE cp862 NO-BREAK SPACE cp863 NO-BREAK SPACE cp865 NO-BREAK SPACE cp866 NO-BREAK SPACE cp869 NO-BREAK SPACE cp875 APPLICATION PROGRAM COMMAND cp932 cp936 cp949 cp950 gsm0338 ? QUESTION MARK hp-roman8 ÿ LATIN SMALL LETTER Y WITH DIAERESIS iso-2022-jp \xFF iso-2022-jp-1 \xFF iso-2022-kr \xFF iso-8859-1 ÿ LATIN SMALL LETTER Y WITH DIAERESIS iso-8859-10 ĸ LATIN SMALL LETTER KRA iso-8859-13 ’ RIGHT SINGLE QUOTATION MARK iso-8859-14 ÿ LATIN SMALL LETTER Y WITH DIAERESIS iso-8859-15 ÿ LATIN SMALL LETTER Y WITH DIAERESIS iso-8859-16 ÿ LATIN SMALL LETTER Y WITH DIAERESIS iso-8859-2 ˙ DOT ABOVE iso-8859-3 ˙ DOT ABOVE iso-8859-4 ˙ DOT ABOVE iso-8859-5 џ CYRILLIC SMALL LETTER DZHE iso-8859-9 ÿ LATIN SMALL LETTER Y WITH DIAERESIS koi8-f Ъ CYRILLIC CAPITAL LETTER HARD SIGN koi8-r Ъ CYRILLIC CAPITAL LETTER HARD SIGN koi8-u Ъ CYRILLIC CAPITAL LETTER HARD SIGN MacArabic ے ARABIC LETTER YEH BARREE MacCentralEurRoman ˇ CARON MacChineseSimp … HORIZONTAL ELLIPSIS MacChineseTrad … HORIZONTAL ELLIPSIS MacCroatian ˇ CARON MacCyrillic € EURO SIGN MacFarsi ے ARABIC LETTER YEH BARREE MacGreek SOFT HYPHEN MacHebrew | VERTICAL LINE MacIcelandic ˇ CARON MacJapanese … MacKorean … MacRoman ˇ CARON MacRomanian ˇ CARON MacRumanian ˇ CARON MacSami ǩ LATIN SMALL LETTER K WITH CARON MacTurkish ˇ CARON posix-bc ~ TILDE UTF-7 ÿ LATIN SMALL LETTER Y WITH DIAERESIS viscii Ữ LATIN CAPITAL LETTER U WITH HORN AND TILDE Enjoy, Have FUN! H.Merijn | [reply] [d/l] |
by marinersk (Priest) on Jun 17, 2017 at 01:59 UTC | |
Touché. | [reply] |
by Anonymous Monk on Jun 17, 2017 at 04:03 UTC | |
$ perl -MDP -we'my@x=("",undef,"1","123","2","\xff","\x{00ff}");DPeek for map{$_->[1]}sort{$a->[0]cmp$b->[0]}map{[$_//"\x{1ffff}",$_]}@x'Not sure what you're trying to prove with this. Do you think U+1ffff is the biggest Unicode character? And chr(255) is *not* per definition an y with two dots.Which is exactly why I said I was *hoping* it would get replaced with something less than 255. People around here need less sensitive nerd-rage triggers. | [reply] [d/l] |
Re^4: Sort undef
by marinersk (Priest) on Jun 14, 2017 at 08:17 UTC | |
It would seem that deaccent()would modify the data to a sub-255 value, leaving a single 255 in the Schwartian Transform as a viable sort max key -- as noted above, this should be proven before deployed. As to your other note, Unicode characters "above 255" are actually multi-byte sequences whose individual bytes still cannot exceed the architectural limitation of chr(255) so I question that perceived vulnerability. | [reply] [d/l] [select] |
by Anonymous Monk on Jun 17, 2017 at 03:50 UTC | |
Output:
| [reply] [d/l] [select] |
by marinersk (Priest) on Aug 15, 2017 at 10:48 UTC | |
Going back to my attitude when I was a C programmer: It pays to know how your compiler thinks. (Needs adjustment for application to modern use of Perl, but the sentiment is the same.) Adding/replacing these three lines into my original script above:
Yields:
A string of chr(255)bytes longer than the longest item in the original array still fails to sort to the bottom; knowing that Unicode characters are stored differently than old-fashioned ASCII strings empowers the Perl programmer to make a better choice. Thank you for the information! I'd upvote the post, but there isn't any point, as it's Anonymous Monk. | [reply] [d/l] [select] |