comment on

But then a Unicode character comes up, and suddenly writing text to stdout produces garbage characters and Perl issues a warning about it.

Add use open qw/:std :utf8/; at the top of your code to open STDIN/OUT/ERR as UTF-8 (assuming your console is UTF-8).

I verify that strings are flagged as native or as UTF-8 at the places where they should be.

You should only be checking the UTF8 flag for debugging purposes when you find problems with your code, and not assuming what its state should be - it's an internal flag that can (and will) change across Perl versions.

Perl will then have a mixture of 'native' and 'UTF-8' strings to concatenate. How does that work? Even if there are no characters above 127, Perl will have to scan all 'native' strings, if only to issue a warning for high characters. Is that right? If all strings were flagged as UTF8, concatenation should be faster, shouldn't it?

I think you're worrying too much about the internals here. Perl generally does the right thing; you should only worry about it if you actually have problems with your code (write tests to check the input and output of your code), and you should only worry about speed if it becomes an issue for you.

In general, for the best Unicode support, use the latest version of Perl (5.26 is pretty good, but think about upgrading using e.g. perlbrew), encode your source files as UTF-8, and start them off like this:

#!/usr/bin/env perl
use warnings;
use 5.026; # or higher, enables "unicode_strings" features etc.
use utf8; # source code is encoded with UTF-8
use open qw/:std :utf8/; # make UTF-8 default encoding, incl STDIN+OUT
use warnings FATAL => 'utf8'; # optional
[download]

And make sure to always specify the correct encoding when opening files ("open" Best Practices). If you have problems, feel free to post them here, see also my advice on that here.

I think I will code the removal of trailing slashes with a regular expression, as that should respect the flag.

See the core module File::Spec for how to do operations on filenames in a portable way.

Update: Added "use warnings FATAL => 'utf8';"

In reply to Re^3: substr on UTF-8 strings by haukex
in thread substr on UTF-8 strings by rdiez

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl-Sensitive Sunglasses
	PerlMonks