Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^3: substr on UTF-8 strings

by haukex (Archbishop)
on Jun 24, 2020 at 12:46 UTC ( [id://11118421]=note: print w/replies, xml ) Need Help??


in reply to Re^2: substr on UTF-8 strings
in thread substr on UTF-8 strings

But then a Unicode character comes up, and suddenly writing text to stdout produces garbage characters and Perl issues a warning about it.

Add use open qw/:std :utf8/; at the top of your code to open STDIN/OUT/ERR as UTF-8 (assuming your console is UTF-8).

I verify that strings are flagged as native or as UTF-8 at the places where they should be.

You should only be checking the UTF8 flag for debugging purposes when you find problems with your code, and not assuming what its state should be - it's an internal flag that can (and will) change across Perl versions.

Perl will then have a mixture of 'native' and 'UTF-8' strings to concatenate. How does that work? Even if there are no characters above 127, Perl will have to scan all 'native' strings, if only to issue a warning for high characters. Is that right? If all strings were flagged as UTF8, concatenation should be faster, shouldn't it?

I think you're worrying too much about the internals here. Perl generally does the right thing; you should only worry about it if you actually have problems with your code (write tests to check the input and output of your code), and you should only worry about speed if it becomes an issue for you.

In general, for the best Unicode support, use the latest version of Perl (5.26 is pretty good, but think about upgrading using e.g. perlbrew), encode your source files as UTF-8, and start them off like this:

#!/usr/bin/env perl use warnings; use 5.026; # or higher, enables "unicode_strings" features etc. use utf8; # source code is encoded with UTF-8 use open qw/:std :utf8/; # make UTF-8 default encoding, incl STDIN+OUT use warnings FATAL => 'utf8'; # optional

And make sure to always specify the correct encoding when opening files ("open" Best Practices). If you have problems, feel free to post them here, see also my advice on that here.

I think I will code the removal of trailing slashes with a regular expression, as that should respect the flag.

See the core module File::Spec for how to do operations on filenames in a portable way.

Update: Added "use warnings FATAL => 'utf8';"

Replies are listed 'Best First'.
Re^4: substr on UTF-8 strings
by ikegami (Patriarch) on Jun 26, 2020 at 19:35 UTC
Re^4: substr on UTF-8 strings
by rdiez (Acolyte) on Jun 24, 2020 at 13:14 UTC
    Add use open qw/:std :utf8/; at the top of your code to open STDIN/OUT/ERR as UTF-8 (assuming your console is UTF-8).

    Yes, that is the usual advice. But it is wrong in practice, in my opinion.

    First of all, who knows where my script, or parts of it, will land. Maybe on Windows. I did write once a Perl script that I was using heavily on Windows. Why should I assume that my console is using UTF-8?

    But most importantly, if you automatically open all files in UTF-8, then you'll have serious limitations. Say a file has an invalid UTF-8 sequence. What will Perl do? Die on read? Or just write a warning on stderr, so that the script will never know? Such warnings do not really help the end user. If you tell Perl not to check UTF-8 for validity on the file, will it really not check? Perl is internally doing many string operations, one of them may suddenly write a warning to stderr. What if you do want to check for UTF-8 encoding errors? What if your file mixes binary and UTF-8? Life is not that simple.

    In my current script, I am reading a UTF-8 text file. I am opening the file in "raw" mode, and decoding every text line myself. This way, when a line has UTF-8 encoding errors, my script can cleanly tell the user what file number the error is in. You cannot do that if you let Perl I/O handle things magically.

    See the core module File::Spec for how to do operations on filenames in a portable way.

    File::Spec is a disaster. Just try this:

    use File::Spec; foreach my $str ( File::Spec->splitdir( "/a///b/" ) ) { print "- \"$str\"\n"; }

    This is what you get:

    - "" - "a" - "" - "" - "b" - ""

    It does not collapse multiple '/' separators like POSIX says you should do. It adds empty directories before the first / and after the last /. That weird behavior is not documented (or did I miss it?) That does not really help.

    You have to do everything manually if you want the job to be done properly. It is actually a shame, because I really like Perl.

      Why should I assume that my console is using UTF-8?

      Sure, that's a valid point. But then again, if someone is going to be writing Unicode data to the console in the first place, what encoding should they use? If this is a tool in a UNIX command pipeline, UTF-8 seems fine to me, it can be piped to iconv, or one can add a command-line option to change the output encoding if desired. What the best way is depends on the situation; my point was just a response to your apparent complaint about the warning from Perl.

      But most importantly, if you automatically open all files in UTF-8, then you'll have serious limitations.

      You can override the default by explicitly specifying an encoding in open.

      What if you do want to check for UTF-8 encoding errors?

      use warnings FATAL => 'utf8'; - people also add this to their boilerplate for that reason. (Update: I've added it to my boilerplate above.)

      What if your file mixes binary and UTF-8?

      In that case you'd have to fiddle with binmode or a manual decode anyway, no matter what the default encoding is.

      File::Spec is a disaster.

      It does have its faults, but in general, I disagree - in my experience it's much more common for people to put bugs in their code* via their reinvented wheel filename handling. But anyway, its canonpath removes a trailing slash. Of course, if you're certain your script will only run on POSIX systems, you can just use a regex to strip the trailing slash.

      * Update 2: Or at least make their code less portable. Also, note the module's canonpath does collapse "/a///b/" to "/a/b".

      First of all, who knows where my script, or parts of it, will land. Maybe on Windows

      See here for a portable version.

      That said, you'd want to switch your console to chcp 65001 and use UTF-8 if dealing with Unicode anyway.

      What if your file mixes binary and UTF-8?

      Binary files should be opened using :raw. This will override use open. Any portion that requires UTF-8 from decoded text can use Encode's encode or the builtin utf8::encode.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11118421]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-04-25 09:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found