G'day vrk,
"Besides, Unicode codepoints often aren't ordered alphabetically in any script, so you wouldn't get a sorted (collated) sequence even if it did."
[Note:
There's no intended pedantry here; however, as I understand your statement,
I believe you mean "characters", not "codepoints".
On that basis, I don't disagree with your statement, at all.
The distinction is important for the remainder of my response.]
The builtin module Unicode::Collate
can be used for sorting Unicode characters.
$ perl -E 'say for sort qw{z é a}'
a
z
é
$ perl -MUnicode::Collate -E 'say for Unicode::Collate::->new->sort(qw
+{z é a})'
a
é
z
The code points are numerical values: a numerical sort is required for these.
$ perl -E 'say for sort map { ord } qw{z é a}'
122
195
97
$ perl -E 'say for sort { $a <=> $b } map { ord } qw{z é a}'
97
122
195
Code points are often presented as hexidecimal strings (that may have a leading "U+").
When dealing with these, it can be useful to first convert them to some canonical format.
As the code point range is 0 .. 0x10ffff, an sprintf format including
"%06x" or "%06X" handles all cases.
$ perl -E 'say sprintf "U+%06X", $_ for map { ord } qw{z é a}'
U+00007A
U+0000C3
U+000061
|