Re^4: substr on UTF-8 strings

Add use open qw/:std :utf8/; at the top of your code to open STDIN/OUT/ERR as UTF-8 (assuming your console is UTF-8).

Yes, that is the usual advice. But it is wrong in practice, in my opinion.

First of all, who knows where my script, or parts of it, will land. Maybe on Windows. I did write once a Perl script that I was using heavily on Windows. Why should I assume that my console is using UTF-8?

But most importantly, if you automatically open all files in UTF-8, then you'll have serious limitations. Say a file has an invalid UTF-8 sequence. What will Perl do? Die on read? Or just write a warning on stderr, so that the script will never know? Such warnings do not really help the end user. If you tell Perl not to check UTF-8 for validity on the file, will it really not check? Perl is internally doing many string operations, one of them may suddenly write a warning to stderr. What if you do want to check for UTF-8 encoding errors? What if your file mixes binary and UTF-8? Life is not that simple.

In my current script, I am reading a UTF-8 text file. I am opening the file in "raw" mode, and decoding every text line myself. This way, when a line has UTF-8 encoding errors, my script can cleanly tell the user what file number the error is in. You cannot do that if you let Perl I/O handle things magically.

See the core module File::Spec for how to do operations on filenames in a portable way.

File::Spec is a disaster. Just try this:

use File::Spec;

foreach my $str ( File::Spec->splitdir( "/a///b/" ) )
{
    print "- \"$str\"\n";
}
[download]

This is what you get:

- ""
- "a"
- ""
- ""
- "b"
- ""
[download]

It does not collapse multiple '/' separators like POSIX says you should do. It adds empty directories before the first / and after the last /. That weird behavior is not documented (or did I miss it?) That does not really help.

You have to do everything manually if you want the job to be done properly. It is actually a shame, because I really like Perl.

Comment on Re^4: substr on UTF-8 strings Select or Download Code

Replies are listed 'Best First'.
Re^5: substr on UTF-8 strings by haukex (Archbishop) on Jun 24, 2020 at 13:36 UTC
Why should I assume that my console is using UTF-8? Sure, that's a valid point. But then again, if someone is going to be writing Unicode data to the console in the first place, what encoding should they use? If this is a tool in a UNIX command pipeline, UTF-8 seems fine to me, it can be piped to iconv, or one can add a command-line option to change the output encoding if desired. What the best way is depends on the situation; my point was just a response to your apparent complaint about the warning from Perl. But most importantly, if you automatically open all files in UTF-8, then you'll have serious limitations. You can override the default by explicitly specifying an encoding in open. What if you do want to check for UTF-8 encoding errors? `use warnings FATAL => 'utf8';` - people also add this to their boilerplate for that reason. (Update: I've added it to my boilerplate above.) What if your file mixes binary and UTF-8? In that case you'd have to fiddle with binmode or a manual `decode` anyway, no matter what the default encoding is. File::Spec is a disaster. It does have its faults, but in general, I disagree - in my experience it's much more common for people to put bugs in their code* via their reinvented wheel filename handling. But anyway, its `canonpath` removes a trailing slash. Of course, if you're certain your script will only run on POSIX systems, you can just use a regex to strip the trailing slash. * Update 2: Or at least make their code less portable. Also, note the module's `canonpath` does collapse `"/a///b/"` to `"/a/b"`.	[reply] [d/l] [select]
Re^5: substr on UTF-8 strings by ikegami (Patriarch) on Jun 26, 2020 at 19:50 UTC
First of all, who knows where my script, or parts of it, will land. Maybe on Windows See here for a portable version. That said, you'd want to switch your console to chcp 65001 and use UTF-8 if dealing with Unicode anyway. What if your file mixes binary and UTF-8? Binary files should be opened using `:raw`. This will override `use open`. Any portion that requires UTF-8 from decoded text can use Encode's `encode` or the builtin `utf8::encode`.	[reply] [d/l] [select]


more useful options
	PerlMonks