Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^4: substr on UTF-8 strings

by rdiez (Acolyte)
on Jun 24, 2020 at 13:14 UTC ( [id://11118427]=note: print w/replies, xml ) Need Help??


in reply to Re^3: substr on UTF-8 strings
in thread substr on UTF-8 strings

Add use open qw/:std :utf8/; at the top of your code to open STDIN/OUT/ERR as UTF-8 (assuming your console is UTF-8).

Yes, that is the usual advice. But it is wrong in practice, in my opinion.

First of all, who knows where my script, or parts of it, will land. Maybe on Windows. I did write once a Perl script that I was using heavily on Windows. Why should I assume that my console is using UTF-8?

But most importantly, if you automatically open all files in UTF-8, then you'll have serious limitations. Say a file has an invalid UTF-8 sequence. What will Perl do? Die on read? Or just write a warning on stderr, so that the script will never know? Such warnings do not really help the end user. If you tell Perl not to check UTF-8 for validity on the file, will it really not check? Perl is internally doing many string operations, one of them may suddenly write a warning to stderr. What if you do want to check for UTF-8 encoding errors? What if your file mixes binary and UTF-8? Life is not that simple.

In my current script, I am reading a UTF-8 text file. I am opening the file in "raw" mode, and decoding every text line myself. This way, when a line has UTF-8 encoding errors, my script can cleanly tell the user what file number the error is in. You cannot do that if you let Perl I/O handle things magically.

See the core module File::Spec for how to do operations on filenames in a portable way.

File::Spec is a disaster. Just try this:

use File::Spec; foreach my $str ( File::Spec->splitdir( "/a///b/" ) ) { print "- \"$str\"\n"; }

This is what you get:

- "" - "a" - "" - "" - "b" - ""

It does not collapse multiple '/' separators like POSIX says you should do. It adds empty directories before the first / and after the last /. That weird behavior is not documented (or did I miss it?) That does not really help.

You have to do everything manually if you want the job to be done properly. It is actually a shame, because I really like Perl.

Replies are listed 'Best First'.
Re^5: substr on UTF-8 strings
by haukex (Archbishop) on Jun 24, 2020 at 13:36 UTC
    Why should I assume that my console is using UTF-8?

    Sure, that's a valid point. But then again, if someone is going to be writing Unicode data to the console in the first place, what encoding should they use? If this is a tool in a UNIX command pipeline, UTF-8 seems fine to me, it can be piped to iconv, or one can add a command-line option to change the output encoding if desired. What the best way is depends on the situation; my point was just a response to your apparent complaint about the warning from Perl.

    But most importantly, if you automatically open all files in UTF-8, then you'll have serious limitations.

    You can override the default by explicitly specifying an encoding in open.

    What if you do want to check for UTF-8 encoding errors?

    use warnings FATAL => 'utf8'; - people also add this to their boilerplate for that reason. (Update: I've added it to my boilerplate above.)

    What if your file mixes binary and UTF-8?

    In that case you'd have to fiddle with binmode or a manual decode anyway, no matter what the default encoding is.

    File::Spec is a disaster.

    It does have its faults, but in general, I disagree - in my experience it's much more common for people to put bugs in their code* via their reinvented wheel filename handling. But anyway, its canonpath removes a trailing slash. Of course, if you're certain your script will only run on POSIX systems, you can just use a regex to strip the trailing slash.

    * Update 2: Or at least make their code less portable. Also, note the module's canonpath does collapse "/a///b/" to "/a/b".

Re^5: substr on UTF-8 strings
by ikegami (Patriarch) on Jun 26, 2020 at 19:50 UTC

    First of all, who knows where my script, or parts of it, will land. Maybe on Windows

    See here for a portable version.

    That said, you'd want to switch your console to chcp 65001 and use UTF-8 if dealing with Unicode anyway.

    What if your file mixes binary and UTF-8?

    Binary files should be opened using :raw. This will override use open. Any portion that requires UTF-8 from decoded text can use Encode's encode or the builtin utf8::encode.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11118427]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (1)
As of 2024-04-19 18:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found