Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^2: substr on UTF-8 strings

by rdiez (Acolyte)
on Jun 24, 2020 at 12:32 UTC ( [id://11118419]=note: print w/replies, xml ) Need Help??


in reply to Re: substr on UTF-8 strings (updated)
in thread substr on UTF-8 strings

I am finding Unicode support in Perl hard. Most of my strings are ASCII, so there usually is no trouble. But then a Unicode character comes up, and suddenly writing text to stdout produces garbage characters and Perl issues a warning about it.

So I have come up with an assert strategy: during development, I enable my "UTF-8 asserts", so that I verify that strings are flagged as native or as UTF-8 at the places where they should be. This has helped me prevent errors. And that is how I realised that substr() behaves differently.

If I capture those trailing slashes with a regular expression, the (UTF-8/native) flag is preserved. I think I will code the removal of trailing slashes with a regular expression, as that should respect the flag.

Say substr sees that all sliced characters are ASCII and sets the "native string" flag. Say my code slices some other path components, some of which do have Unicode characters, so that those strings remain flagged as UTF-8. Let's assume that all those strings are concatenated together afterwards.

Perl will then have a mixture of 'native' and 'UTF-8' strings to concatenate. How does that work? Even if there are no characters above 127, Perl will have to scan all 'native' strings, if only to issue a warning for high characters. Is that right? If all strings were flagged as UTF-8, concatenation should be faster, shouldn't it?

In any case, is there a good reason why substr should take a 'UTF-8 string' and return a 'native string'? I have heard that other routines do respect the flag.

Replies are listed 'Best First'.
Re^3: substr on UTF-8 strings
by haukex (Archbishop) on Jun 24, 2020 at 12:46 UTC
    But then a Unicode character comes up, and suddenly writing text to stdout produces garbage characters and Perl issues a warning about it.

    Add use open qw/:std :utf8/; at the top of your code to open STDIN/OUT/ERR as UTF-8 (assuming your console is UTF-8).

    I verify that strings are flagged as native or as UTF-8 at the places where they should be.

    You should only be checking the UTF8 flag for debugging purposes when you find problems with your code, and not assuming what its state should be - it's an internal flag that can (and will) change across Perl versions.

    Perl will then have a mixture of 'native' and 'UTF-8' strings to concatenate. How does that work? Even if there are no characters above 127, Perl will have to scan all 'native' strings, if only to issue a warning for high characters. Is that right? If all strings were flagged as UTF8, concatenation should be faster, shouldn't it?

    I think you're worrying too much about the internals here. Perl generally does the right thing; you should only worry about it if you actually have problems with your code (write tests to check the input and output of your code), and you should only worry about speed if it becomes an issue for you.

    In general, for the best Unicode support, use the latest version of Perl (5.26 is pretty good, but think about upgrading using e.g. perlbrew), encode your source files as UTF-8, and start them off like this:

    #!/usr/bin/env perl use warnings; use 5.026; # or higher, enables "unicode_strings" features etc. use utf8; # source code is encoded with UTF-8 use open qw/:std :utf8/; # make UTF-8 default encoding, incl STDIN+OUT use warnings FATAL => 'utf8'; # optional

    And make sure to always specify the correct encoding when opening files ("open" Best Practices). If you have problems, feel free to post them here, see also my advice on that here.

    I think I will code the removal of trailing slashes with a regular expression, as that should respect the flag.

    See the core module File::Spec for how to do operations on filenames in a portable way.

    Update: Added "use warnings FATAL => 'utf8';"

      Add use open qw/:std :utf8/; at the top of your code to open STDIN/OUT/ERR as UTF-8 (assuming your console is UTF-8).

      Yes, that is the usual advice. But it is wrong in practice, in my opinion.

      First of all, who knows where my script, or parts of it, will land. Maybe on Windows. I did write once a Perl script that I was using heavily on Windows. Why should I assume that my console is using UTF-8?

      But most importantly, if you automatically open all files in UTF-8, then you'll have serious limitations. Say a file has an invalid UTF-8 sequence. What will Perl do? Die on read? Or just write a warning on stderr, so that the script will never know? Such warnings do not really help the end user. If you tell Perl not to check UTF-8 for validity on the file, will it really not check? Perl is internally doing many string operations, one of them may suddenly write a warning to stderr. What if you do want to check for UTF-8 encoding errors? What if your file mixes binary and UTF-8? Life is not that simple.

      In my current script, I am reading a UTF-8 text file. I am opening the file in "raw" mode, and decoding every text line myself. This way, when a line has UTF-8 encoding errors, my script can cleanly tell the user what file number the error is in. You cannot do that if you let Perl I/O handle things magically.

      See the core module File::Spec for how to do operations on filenames in a portable way.

      File::Spec is a disaster. Just try this:

      use File::Spec; foreach my $str ( File::Spec->splitdir( "/a///b/" ) ) { print "- \"$str\"\n"; }

      This is what you get:

      - "" - "a" - "" - "" - "b" - ""

      It does not collapse multiple '/' separators like POSIX says you should do. It adds empty directories before the first / and after the last /. That weird behavior is not documented (or did I miss it?) That does not really help.

      You have to do everything manually if you want the job to be done properly. It is actually a shame, because I really like Perl.

        Why should I assume that my console is using UTF-8?

        Sure, that's a valid point. But then again, if someone is going to be writing Unicode data to the console in the first place, what encoding should they use? If this is a tool in a UNIX command pipeline, UTF-8 seems fine to me, it can be piped to iconv, or one can add a command-line option to change the output encoding if desired. What the best way is depends on the situation; my point was just a response to your apparent complaint about the warning from Perl.

        But most importantly, if you automatically open all files in UTF-8, then you'll have serious limitations.

        You can override the default by explicitly specifying an encoding in open.

        What if you do want to check for UTF-8 encoding errors?

        use warnings FATAL => 'utf8'; - people also add this to their boilerplate for that reason. (Update: I've added it to my boilerplate above.)

        What if your file mixes binary and UTF-8?

        In that case you'd have to fiddle with binmode or a manual decode anyway, no matter what the default encoding is.

        File::Spec is a disaster.

        It does have its faults, but in general, I disagree - in my experience it's much more common for people to put bugs in their code* via their reinvented wheel filename handling. But anyway, its canonpath removes a trailing slash. Of course, if you're certain your script will only run on POSIX systems, you can just use a regex to strip the trailing slash.

        * Update 2: Or at least make their code less portable. Also, note the module's canonpath does collapse "/a///b/" to "/a/b".

        First of all, who knows where my script, or parts of it, will land. Maybe on Windows

        See here for a portable version.

        That said, you'd want to switch your console to chcp 65001 and use UTF-8 if dealing with Unicode anyway.

        What if your file mixes binary and UTF-8?

        Binary files should be opened using :raw. This will override use open. Any portion that requires UTF-8 from decoded text can use Encode's encode or the builtin utf8::encode.

Re^3: substr on UTF-8 strings
by Your Mother (Archbishop) on Jun 26, 2020 at 19:46 UTC
    I am finding Unicode support in Perl hard.

    You have a lot of good practical replies already. I just want to say: Unicode support in Perl (in versions contemporary to the comparisons being made) is simply the best there is; pretty unambiguously and objectively last time I checked. Some of it is extremely confusing because that’s just the bill of goods involved. Perl gives you ample tools to solve the problems.

Re^3: substr on UTF-8 strings
by ikegami (Patriarch) on Jun 26, 2020 at 19:33 UTC

    I am finding Unicode support in Perl hard. Most of my strings are ASCII, so there usually is no trouble. But then a Unicode character comes up, and suddenly writing text to stdout produces garbage characters and Perl issues a warning about it.

    You are probably outputing decoded text (i.e. Unicode Code Points) to a file handle expecting encoded text (e.g. UTF-8). You can cause the encoding to happen automatically using

    use open ':std', ':encoding(UTF-8)';

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11118419]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-25 22:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found