Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

substr on UTF-8 strings

by rdiez (Acolyte)
on Jun 24, 2020 at 12:00 UTC ( #11118415=perlquestion: print w/replies, xml ) Need Help??

rdiez has asked for the wisdom of the Perl Monks concerning the following question:

I need to remove eventual trailing directory separators (slash character '/', maybe several of them together) from a file path. The Perl string with the file path was originally read from a UTF-8 text file and is internally flagged as UTF-8. This is only an example, I am doing more string operations like that.

I would have thought that doing a substr() on a UTF-8 string would respect the UTF-8 flag. However, the string I get back is always flagged as "native/raw bytes". Well, at least with the test strings I am using, which admittedly are all ASCII at the moment. In any case, changing the UTF-8/native flag creates problems in my code later on the line.

Is this substr() behaviour expected? I have tested it with the Perl version v5.26.1 that comes with Ubuntu 18.04.4.

Replies are listed 'Best First'.
Re: substr on UTF-8 strings
by Corion (Pope) on Jun 24, 2020 at 12:11 UTC

    Your code should not look at the "UTF-8 flag" anyway, but always use decode() and encode() to decode the input from whatever input encoding it is in, and encode to whatever encoding the output expects.

    In the meantime, it sounds weird that substr would (re)set that flag in a way that causes errors in the subsequent strings. Can you maybe post a reduced example (ideally with the source strings escaped)?

    Something like the following, except of course that it fails?

    #!perl use strict; use warnings; use Encode 'encode', 'decode'; use charnames ':full'; use Test::More; use Data::Dumper; $Data::Dumper::Useqq = 1; my $octets = "This is the raw input string, also containing an umlaut, + in UTF-8 bytes: mot\xc3\x96rhead ... and some more text"; my $expected = "This is the raw input string, also containing an umlau +t, in UTF-8 bytes: mot\N{LATIN CAPITAL LETTER O WITH DIAERESIS}rhead +... and some more text"; my $string = decode('UTF-8', $octets); is $string, $expected, "The decoded strings are identical (sanity chec +k)"; my $part = substr( $string, 73, 9 ); is $part, "mot\N{LATIN CAPITAL LETTER O WITH DIAERESIS}rhead", "We sni +p the correct part" or diag Dumper [$part, "mot\N{LATIN CAPITAL LETTER O WITH DIAERESI +S}rhead"]; done_testing();
Re: substr on UTF-8 strings
by haj (Curate) on Jun 24, 2020 at 14:44 UTC

    As others have pointed out, fiddling with the UTF-8 flag is a really bad idea.

    However, you might have encountered problems in your code later because of a known gotcha between Perl and file systems. The hardware of your file system doesn't know about encodings, it only knows bytes. So someone has to encode non-ASCII-characters in file names, and here's a place where Perl itself trips over the UTF-8 flag.

    In short: When you have a path name with the UTF-8-flag on, then Perl will silently encode the path name as UTF-8 before passing it to the file system (or, more precise, to the C library). This is particularly nasty for characters which can be expressed in Perl's default 1-Byte encoding, like e.g. "š". Depending on the history of an "š" in Perl code, it is stored either as one byte, with the UTF-8 flag off, or as two bytes, with the UTF-8 flag on. So from two characters which compare "equal" to Perl, Perl creates two different path names.

    For substr and other Perl functions it is really, really best to leave the UTF-8 flag alone. The flag is "internal", which means its value is not documented nor guaranteed not to change with Perl versions.

    For the details, let's start with a demo program:

    use strict; use warnings; use File::Temp qw/tempdir/; use Test::More tests => 14; use Encode qw/decode/; my $ae = "\N{LATIN SMALL LETTER A WITH DIAERESIS}"; is (length $ae, 1, "The a with diaeresis is one character to Perl,"); ok (utf8::is_utf8($ae), " ...and Perl knows it to be UTF-8 (UTF-8 flag on),"); my $ae2 = $ae; ok(utf8::downgrade($ae2), " ...and it can be downgraded"); ok(! utf8::is_utf8($ae2), " ...so that is no longer stored as UTF-8 (UTF-8 flag off)."); is ($ae,$ae2,"UTF-8 encoded and downgraded versions are equal."); note ("Now we are creating file(s), using the character as a filename. +"); my $tempdir = tempdir( CLEANUP => 1 ) or BAIL_OUT "No tempdir, demo terminated."; for my $file ($ae,$ae2) { ok (open (my $out,'>',"$tempdir/$file"), "Opening two files with the same (?) name.") or BAIL_OUT "No file written, demo terminated."; close $out; } note ("Now reading the contents of the directory..."); opendir (my $dh,$tempdir); my @files = sort { length $a <=> length $b } grep { /^[^.]/ } # exclude . and .. readdir $dh; is (scalar @files, 2, "We apparently have created TWO files."); is (length $files[0],1,"The first filename is one byte long,"); is (length $files[1],2,"The second filename is TWO bytes long."); ok(! utf8::is_utf8($files[0]), "The first filename is stored as bytes (UTF-8 flag off)."); ok(! utf8::is_utf8($files[1]), "The second filename is ALSO stored as bytes (UTF-8 flag off)."); is ($files[0],$ae,"The first filename is what we provided."); is (decode('UTF-8',$files[1],Encode::FB_CROAK),$ae, "The second filename needs to be decoded to match what we provided +."); note("There is silent encoding of path names, but no silent decoding!" +);

    Stuff like this makes me think at least twice whether I really need non-ASCII characters in a path name. It gets even messier if you have files which have been created under some old OS versions before UTF-8 was a quasi-standard.

Re: substr on UTF-8 strings (updated)
by haukex (Bishop) on Jun 24, 2020 at 12:12 UTC
    However, the string I get back is always flagged as "native/raw bytes".

    Yes, I can confirm (code below), although only if the original string really is ASCII. Update: To be more clear: I can only confirm this in the case the original string is ASCII; otherwise the UTF-8 flag remains enabled. In the case of an ASCII string, I don't see how it not being flagged as UTF-8 causes problems? /Update

    In any case, changing the UTF-8/native flag creates problems in my code later on the line.

    Perhaps this is the issue we should look at - could you show an SSCCE of how a plain ASCII string without the UTF-8 flag is causing problems for you?

      I am finding Unicode support in Perl hard. Most of my strings are ASCII, so there usually is no trouble. But then a Unicode character comes up, and suddenly writing text to stdout produces garbage characters and Perl issues a warning about it.

      So I have come up with an assert strategy: during development, I enable my "UTF-8 asserts", so that I verify that strings are flagged as native or as UTF-8 at the places where they should be. This has helped me prevent errors. And that is how I realised that substr() behaves differently.

      If I capture those trailing slashes with a regular expression, the (UTF-8/native) flag is preserved. I think I will code the removal of trailing slashes with a regular expression, as that should respect the flag.

      Say substr sees that all sliced characters are ASCII and sets the "native string" flag. Say my code slices some other path components, some of which do have Unicode characters, so that those strings remain flagged as UTF-8. Let's assume that all those strings are concatenated together afterwards.

      Perl will then have a mixture of 'native' and 'UTF-8' strings to concatenate. How does that work? Even if there are no characters above 127, Perl will have to scan all 'native' strings, if only to issue a warning for high characters. Is that right? If all strings were flagged as UTF-8, concatenation should be faster, shouldn't it?

      In any case, is there a good reason why substr should take a 'UTF-8 string' and return a 'native string'? I have heard that other routines do respect the flag.

        But then a Unicode character comes up, and suddenly writing text to stdout produces garbage characters and Perl issues a warning about it.

        Add use open qw/:std :utf8/; at the top of your code to open STDIN/OUT/ERR as UTF-8 (assuming your console is UTF-8).

        I verify that strings are flagged as native or as UTF-8 at the places where they should be.

        You should only be checking the UTF8 flag for debugging purposes when you find problems with your code, and not assuming what its state should be - it's an internal flag that can (and will) change across Perl versions.

        Perl will then have a mixture of 'native' and 'UTF-8' strings to concatenate. How does that work? Even if there are no characters above 127, Perl will have to scan all 'native' strings, if only to issue a warning for high characters. Is that right? If all strings were flagged as UTF8, concatenation should be faster, shouldn't it?

        I think you're worrying too much about the internals here. Perl generally does the right thing; you should only worry about it if you actually have problems with your code (write tests to check the input and output of your code), and you should only worry about speed if it becomes an issue for you.

        In general, for the best Unicode support, use the latest version of Perl (5.26 is pretty good, but think about upgrading using e.g. perlbrew), encode your source files as UTF-8, and start them off like this:

        #!/usr/bin/env perl use warnings; use 5.026; # or higher, enables "unicode_strings" features etc. use utf8; # source code is encoded with UTF-8 use open qw/:std :utf8/; # make UTF-8 default encoding, incl STDIN+OUT use warnings FATAL => 'utf8'; # optional

        And make sure to always specify the correct encoding when opening files ("open" Best Practices). If you have problems, feel free to post them here, see also my advice on that here.

        I think I will code the removal of trailing slashes with a regular expression, as that should respect the flag.

        See the core module File::Spec for how to do operations on filenames in a portable way.

        Update: Added "use warnings FATAL => 'utf8';"

        I am finding Unicode support in Perl hard.

        You have a lot of good practical replies already. I just want to say: Unicode support in Perl (in versions contemporary to the comparisons being made) is simply the best there is; pretty unambiguously and objectively last time I checked. Some of it is extremely confusing because thatís just the bill of goods involved. Perl gives you ample tools to solve the problems.

        I am finding Unicode support in Perl hard. Most of my strings are ASCII, so there usually is no trouble. But then a Unicode character comes up, and suddenly writing text to stdout produces garbage characters and Perl issues a warning about it.

        You are probably outputing decoded text (i.e. Unicode Code Points) to a file handle expecting encoded text (e.g. UTF-8). You can cause the encoding to happen automatically using

        use open ':std', ':encoding(UTF-8)';
Re: substr on UTF-8 strings
by ikegami (Pope) on Jun 26, 2020 at 19:31 UTC

    The Perl string with the file path was originally read from a UTF-8 text file and is internally flagged as UTF-8.

    So you say you have decoded text (aka a string of Unicode Code Points) which is stored using the UTF=8 internal storage format?

    However, the string I get back is always flagged as "native/raw bytes".

    Perl is free to pick whatever internal storage format it wants.

    That said, I can't reproduce your claim. substr returns a string using the UTF8=1 format if that's the storage format used by the input string.

    $ perl -e' use Devel::Peek qw( Dump ); my $s = "a\N{U+2660}"; Dump($s); my $ss = substr($s, 0, 1); Dump($ss); ' SV = PV(0x7fffd3496ca0) at 0x7fffd34c5a88 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x7fffd34c33a0 "a\342\231\240"\0 [UTF8 "a\x{2660}"] CUR = 4 LEN = 10 COW_REFCNT = 1 SV = PV(0x7fffd3496d30) at 0x7fffd34c5ad0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x7fffd34ad050 "a"\0 [UTF8 "a"] CUR = 1 LEN = 10

      Ah, if the input string contains only characters in 00..7F, the returned string will use the UTF8=0 internal storage format.

      $ perl -e' use Devel::Peek qw( Dumpe ); my $s = "ab"; utf8::upgrade($s); # Force UTF8=1 storage format. Dump($s); my $ss = substr($s, 0, 1); Dump($ss); ' SV = PV(0x7fffdcf3dca0) at 0x7fffdcf6ca78 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x7fffdcf5dbf0 "ab"\0 [UTF8 "ab"] CUR = 2 LEN = 10 SV = PV(0x7fffdcf3dd30) at 0x7fffdcf6cb50 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x7fffdcf54050 "a"\0 CUR = 1 LEN = 10

      But like I said, it's Perl's perogative to pick whatever internal storage format it wants.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11118415]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (6)
As of 2021-03-08 09:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favorite kind of desktop background is:











    Results (123 votes). Check out past polls.

    Notices?