Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Curious about why some characters cause issues with mkdir/print

by YenForYang (Beadle)
on Mar 19, 2018 at 17:25 UTC ( [id://1211251]=perlquestion: print w/replies, xml ) Need Help??

YenForYang has asked for the wisdom of the Perl Monks concerning the following question:

Note: I have a solution/workaround to my problem, BUT I was wondering if there was an explanation for this issue.

Background:

So I have use utf8 (or more precisely BEGIN{$^H != 8388608}) in this perl script of mine, which extracts a string from an HTML file (decoding entities in the process), replaces/transliterates certain strings/characters, and prints it back out. There was a time when I didn't have the pragma added, and it caused issues -- i.e. I got weird characters like U+FFFD � in output. I'm guessing that this happens (correct me if I'm wrong) because I do have non-ASCII characters in my code (specifically, tr/.../..non-ASCII chars here../s) (rather than escape sequences like \x{...} so that I can easily distinguish the characters), since one of the purposes of the script is to transliterate forbidden printable ASCII characters for filenames (Windows/Linux) in the string into a different Unicode character that is allowed (e.g. ? to ‽, \ to \).

So I think I understand what use utf8 does.

Problem:

But I stumbled across another issue recently involving strings that contained some non-ASCII characters. (Note that I only transliterate a handful of characters, so the vast majority of characters are not replaced for the strings I extract). After parsing this arbitrary string and calling CORE::mkdir on it or CORE::print on it, some non-ASCII characters are messed up and replaced with some other character.

An example of one of the characters that caused issues: ☆.

The HTML page originally contained ☆ (the HTML decimal entity equivalent), which was then converted to ☆ by my html parser.

print returns the character â instead of ☆.

What's interesting is that if I remove the use utf8 ( or BEGIN{$^H |=8388608} ) from the script, the problematic character is printed just fine, BUT basically every other non-letter non-number (ASCII?) character like ! Space & etc. is replaced with the aforementioned � character.

What's also interesting is that if I utf8::downgrade or utf8::decode the string before printing, everything prints fine.

So basically I'm asking if anyone has an explanation for this behavior. Thanks.

Replies are listed 'Best First'.
Re: Curious about why some characters cause issues with mkdir/print
by ikegami (Patriarch) on Mar 19, 2018 at 17:50 UTC

    Perl operators that deal with paths suffer from The Unicode Bug. The path actually used is provided by the following sub:

    sub path_actually_used { if (is_utf8($_[0]) { my $s = $_[0]; utf8::encode($s); return $s; } else { return $_[0]; } }

    That means that if you have encoded bytes in an upgraded string, Perl will get it wrong.

    my $s = chr(9734); mkdir($s); # ok utf8::encode($s); mkdir($s); # ok utf8::upgrade($s); mkdir($s); # not ok

    It's virtually impossible to get into that situation without a bug in your code because the encoding functions always return a downgraded string.

    You've already identified the solution:

    • If the path is a string of encoded text (i.e. UTF-8), passing it through utf8::downgrade($s) will ensure it's used correctly.
    • If the path is a string of decoded text (i.e. Unicode Code Points), encoding it (e.g. using utf8::encode($s)) will ensure it's used correctly.
Re: Curious about why some characters cause issues with mkdir/print
by Anonymous Monk on Mar 19, 2018 at 18:07 UTC
    use utf8 does not(!) influence Perl's handling of Unicode data: it only warns Perl's compiler that UTF-8 characters may be present in the source code. If your source-code contains UTF constructions then you must use this pragma in order to cause the source code to be correctly parsed – without this, it might be mis-parsed. But this will not affect what you see on your screen or web-page, which must be separately informed that UTF appears in the data if this is not their default. To remove all doubt, divert this output to a file and then use a hexadecimal viewer to look at the actual byte sequences.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1211251]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2024-04-16 12:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found