The stupid question is the question not asked | |
PerlMonks |
Re^9: Mixed Unicode and ANSI string comparisons?by graff (Chancellor) |
on Dec 16, 2015 at 04:16 UTC ( [id://1150457]=note: print w/replies, xml ) | Need Help?? |
Given that description, any sense of "sorting" seems pretty meaningless. Is there some other term that might better describe a sequencing of elements that is better than random?
If the overall data is (close to) what you describe, my first inclination would be to partition or segregate the data, by checking for the following conditions in the order shown:
Obviously, you have to start by using plain old binmode to read the input as raw bytes. In case you didn't look it up yet, the test for step 3 is: If the eval succeeds, it's utf8 data. Default sorting within some of those partitions would make sense. For the others, it's not so much a matter of making sense, but rather just behaving in some consistent, predictable way. Note that group 2 could actually qualify as a subset of groups 3-5 - and that's a good reason to keep it distinct from those others. Apart from that, if there's some desire to "classify" or "cluster" the non-ASCII, non-Unicode strings, statistics on byte ngrams can help a fair bit with that (but it remains a bit of a research task, with some training of models required for classification). (updated to amend the conditions for set 5)
In Section
Seekers of Perl Wisdom
|
|