Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

UTF-8 lexicographic string sort

by rdiez (Acolyte)
on Apr 23, 2020 at 09:38 UTC ( [id://11115938]=perlquestion: print w/replies, xml ) Need Help??

rdiez has asked for the wisdom of the Perl Monks concerning the following question:

Hi all:

I am writing a new tool like rhash, only with the ability to update hashes. I just got tired of waiting for this bug to be fixed:

  Update hash if file last modification date has changed
  https://github.com/rhash/RHash/issues/107

I have looked around and, surprisingly, there is no hash/checksum tool that does that properly (that I could find).

I think I will be using File::Find to scan files and directories. The first tool version will be in Perl, but it may need to be rewritten later in C for performance reasons (or whatever).

Therefore, I want the "allfiles.checksums" file to list files and their checksums ordered in such a way that you can easily and consistently reimplement the filename sorting in any other language.

I have been reading question "Sorting utf-8" here:

  https://www.perlmonks.org/?node_id=252806

And I also looked at Unicode::Collate and other Perl Unicode documentation.

It is all pretty complicated. I have come to the conclusion that the only safe way to implement this is to do a plain UTF-8 lexicographic string sort on the filenames. I know that humans will find the sort order not good, but I think I can consider the "allfiles.checksums" file an internal database. The script itself could offer options to list its contents with different locale collation orders, if anybody really cares.

How do I implement a pure UTF-8 lexicographic string sort in Perl?

I guess I need to make sure first that the filenames returned by File::Find are actually coded in UTF-8, because Perl may choose some other internal string encoding. I hope that this is what utf8::upgrade is for.

And then I can use binary comparison operators '<' or 'cmp' on those UTF-8 strings. Is that correct?

Thanks in advance,
  rdiez

Replies are listed 'Best First'.
Re: UTF-8 lexicographic string sort
by Corion (Patriarch) on Apr 23, 2020 at 11:01 UTC

    To implement "UTF-8 lexicographic sorting", you merely have to read in the filenames as UTF-8 (or, when reading them from the filesystem via File::Find, use Encode::decode to convert them to Unicode). Note that the filesystem APIs don't know about UTF-8 or any filename encodings, so you will have to encode the filenames appropriately when talking to the filesystem. Perl will do the rest when you sort them. For example, the following code should do what you describe:

    use strict; use warnings; use File::Find; use Encode 'decode'; my @found_files; File::Find::find(sub { push @found_files, decode('UTF-8', $File::Find::name); }, '.'); @found_files = sort @found_files; for my $file (@found_files) { my $fs_name = encode('UTF-8', $file); open my $fh, '<', $fs_name or die "Couldn't open '$file': $!"; };

      I am not sure that your code is correct.

      Let us look at this snippet your suggested:

        decode('UTF-8', $File::Find::name)

      Let us look at the documentation for Encode::decode:

      This function returns the string that results from decoding the scalar value OCTETS, assumed to be a sequence of octets in ENCODING, into Perl's internal form.

      Your code is therefore assuming that $File::Find::name is in UTF-8, but this may not be correct.

        Finding the correct encoding for the filesystem is up to you.

        I'm not aware of any good way to find/know the encoding of the names in a filesystem, so you will have to apply your own knowledge there.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11115938]
Approved by hippo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-04-19 06:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found