Curious about why some characters cause issues with mkdir/print

YenForYang has asked for the wisdom of the Perl Monks concerning the following question:

Note: I have a solution/workaround to my problem, BUT I was wondering if there was an explanation for this issue.

Background:

So I have use utf8 (or more precisely BEGIN{$^H != 8388608}) in this perl script of mine, which extracts a string from an HTML file (decoding entities in the process), replaces/transliterates certain strings/characters, and prints it back out. There was a time when I didn't have the pragma added, and it caused issues -- i.e. I got weird characters like U+FFFD � in output. I'm guessing that this happens (correct me if I'm wrong) because I do have non-ASCII characters in my code (specifically, tr/.../..non-ASCII chars here../s) (rather than escape sequences like \x{...} so that I can easily distinguish the characters), since one of the purposes of the script is to transliterate forbidden printable ASCII characters for filenames (Windows/Linux) in the string into a different Unicode character that is allowed (e.g. ? to ‽, \ to ＼).

So I think I understand what use utf8 does.

Problem:

But I stumbled across another issue recently involving strings that contained some non-ASCII characters. (Note that I only transliterate a handful of characters, so the vast majority of characters are not replaced for the strings I extract). After parsing this arbitrary string and calling CORE::mkdir on it or CORE::print on it, some non-ASCII characters are messed up and replaced with some other character.

An example of one of the characters that caused issues: ☆.

The HTML page originally contained ☆ (the HTML decimal entity equivalent), which was then converted to ☆ by my html parser.

print returns the character ā instead of ☆.

What's interesting is that if I remove the use utf8 ( or BEGIN{$^H |=8388608} ) from the script, the problematic character is printed just fine, BUT basically every other non-letter non-number (ASCII?) character like ! Space & etc. is replaced with the aforementioned � character.

What's also interesting is that if I utf8::downgrade or utf8::decode the string before printing, everything prints fine.

So basically I'm asking if anyone has an explanation for this behavior. Thanks.

Comment on Curious about why some characters cause issues with mkdir/print Select or Download Code

Replies are listed 'Best First'.
Re: Curious about why some characters cause issues with mkdir/print by ikegami (Patriarch) on Mar 19, 2018 at 17:50 UTC
Perl operators that deal with paths suffer from The Unicode Bug. The path actually used is provided by the following sub: `sub path_actually_used { if (is_utf8($_[0]) { my $s = $_[0]; utf8::encode($s); return $s; } else { return $_[0]; } }` [download] That means that if you have encoded bytes in an upgraded string, Perl will get it wrong. `my $s = chr(9734); mkdir($s); # ok utf8::encode($s); mkdir($s); # ok utf8::upgrade($s); mkdir($s); # not ok` [download] It's virtually impossible to get into that situation without a bug in your code because the encoding functions always return a downgraded string. You've already identified the solution: If the path is a string of encoded text (i.e. UTF-8), passing it through `utf8::downgrade($s)` will ensure it's used correctly. If the path is a string of decoded text (i.e. Unicode Code Points), encoding it (e.g. using `utf8::encode($s)`) will ensure it's used correctly.	[reply] [d/l] [select]
Re: Curious about why some characters cause issues with mkdir/print by Anonymous Monk on Mar 19, 2018 at 18:07 UTC
`use utf8` does not(!) influence Perl's handling of Unicode data: it only warns Perl's compiler that UTF-8 characters may be present in the source code. If your source-code contains UTF constructions then you must use this pragma in order to cause the source code to be correctly parsed – without this, it might be mis-parsed. But this will not affect what you see on your screen or web-page, which must be separately informed that UTF appears in the data if this is not their default. To remove all doubt, divert this output to a file and then use a hexadecimal viewer to look at the actual byte sequences.	[reply]


Syntactic Confectionery Delight
	PerlMonks