UTF-8 lexicographic string sort

rdiez has asked for the wisdom of the Perl Monks concerning the following question:

Hi all:

I am writing a new tool like rhash, only with the ability to update hashes. I just got tired of waiting for this bug to be fixed:

Update hash if file last modification date has changed
https://github.com/rhash/RHash/issues/107

I have looked around and, surprisingly, there is no hash/checksum tool that does that properly (that I could find).

I think I will be using File::Find to scan files and directories. The first tool version will be in Perl, but it may need to be rewritten later in C for performance reasons (or whatever).

Therefore, I want the "allfiles.checksums" file to list files and their checksums ordered in such a way that you can easily and consistently reimplement the filename sorting in any other language.

I have been reading question "Sorting utf-8" here:

https://www.perlmonks.org/?node_id=252806

And I also looked at Unicode::Collate and other Perl Unicode documentation.

It is all pretty complicated. I have come to the conclusion that the only safe way to implement this is to do a plain UTF-8 lexicographic string sort on the filenames. I know that humans will find the sort order not good, but I think I can consider the "allfiles.checksums" file an internal database. The script itself could offer options to list its contents with different locale collation orders, if anybody really cares.

How do I implement a pure UTF-8 lexicographic string sort in Perl?

I guess I need to make sure first that the filenames returned by File::Find are actually coded in UTF-8, because Perl may choose some other internal string encoding. I hope that this is what utf8::upgrade is for.

And then I can use binary comparison operators '<' or 'cmp' on those UTF-8 strings. Is that correct?

Thanks in advance,
rdiez

Comment on UTF-8 lexicographic string sort

Replies are listed 'Best First'.
Re: UTF-8 lexicographic string sort by Corion (Patriarch) on Apr 23, 2020 at 11:01 UTC
To implement "UTF-8 lexicographic sorting", you merely have to read in the filenames as UTF-8 (or, when reading them from the filesystem via File::Find, use Encode::decode to convert them to Unicode). Note that the filesystem APIs don't know about UTF-8 or any filename encodings, so you will have to `encode` the filenames appropriately when talking to the filesystem. Perl will do the rest when you sort them. For example, the following code should do what you describe: `use strict; use warnings; use File::Find; use Encode 'decode'; my @found_files; File::Find::find(sub { push @found_files, decode('UTF-8', $File::Find::name); }, '.'); @found_files = sort @found_files; for my $file (@found_files) { my $fs_name = encode('UTF-8', $file); open my $fh, '<', $fs_name or die "Couldn't open '$file': $!"; };` [download]	[reply] [d/l] [select]
Re^2: UTF-8 lexicographic string sort by rdiez (Acolyte) on Apr 23, 2020 at 12:02 UTC
I am not sure that your code is correct. Let us look at this snippet your suggested: decode('UTF-8', $File::Find::name) Let us look at the documentation for Encode::decode: This function returns the string that results from decoding the scalar value OCTETS, assumed to be a sequence of octets in ENCODING, into Perl's internal form. Your code is therefore assuming that $File::Find::name is in UTF-8, but this may not be correct.	[reply]
Re^3: UTF-8 lexicographic string sort by Corion (Patriarch) on Apr 23, 2020 at 12:08 UTC
Finding the correct encoding for the filesystem is up to you. I'm not aware of any good way to find/know the encoding of the names in a filesystem, so you will have to apply your own knowledge there.	[reply]
Re^4: UTF-8 lexicographic string sort by rdiez (Acolyte) on Apr 23, 2020 at 14:15 UTC
Re^5: UTF-8 lexicographic string sort by Corion (Patriarch) on Apr 23, 2020 at 14:22 UTC
Re^5: UTF-8 lexicographic string sort by haukex (Archbishop) on Apr 23, 2020 at 15:28 UTC


Come for the quick hacks, stay for the epiphanies.
	PerlMonks