Come for the quick hacks, stay for the epiphanies. | |
PerlMonks |
UTF-8 lexicographic string sortby rdiez (Acolyte) |
on Apr 23, 2020 at 09:38 UTC ( [id://11115938]=perlquestion: print w/replies, xml ) | Need Help?? |
rdiez has asked for the wisdom of the Perl Monks concerning the following question: Hi all: I am writing a new tool like rhash, only with the ability to update hashes. I just got tired of waiting for this bug to be fixed: Update hash if file last modification date has changed I have looked around and, surprisingly, there is no hash/checksum tool that does that properly (that I could find). I think I will be using File::Find to scan files and directories. The first tool version will be in Perl, but it may need to be rewritten later in C for performance reasons (or whatever). Therefore, I want the "allfiles.checksums" file to list files and their checksums ordered in such a way that you can easily and consistently reimplement the filename sorting in any other language. I have been reading question "Sorting utf-8" here: https://www.perlmonks.org/?node_id=252806 And I also looked at Unicode::Collate and other Perl Unicode documentation. It is all pretty complicated. I have come to the conclusion that the only safe way to implement this is to do a plain UTF-8 lexicographic string sort on the filenames. I know that humans will find the sort order not good, but I think I can consider the "allfiles.checksums" file an internal database. The script itself could offer options to list its contents with different locale collation orders, if anybody really cares. How do I implement a pure UTF-8 lexicographic string sort in Perl? I guess I need to make sure first that the filenames returned by File::Find are actually coded in UTF-8, because Perl may choose some other internal string encoding. I hope that this is what utf8::upgrade is for. And then I can use binary comparison operators '<' or 'cmp' on those UTF-8 strings. Is that correct? Thanks in advance,
Back to
Seekers of Perl Wisdom
|
|