Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

printing Unicode works for some characters but not all

by fireblood (Scribe)
on Jun 04, 2017 at 03:45 UTC ( [id://1192102] : perlquestion . print w/replies, xml ) Need Help??

fireblood has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks,

I have found that I can print some Unicode characters to stdout but not all. I'm not sure why there is a difference.

I am using perl 5.22 under Cygwin on a 64-bit Windows 10 PC.

My code excerpt is as follows:

binmode(STDOUT, ":utf8"); binmode(STDERR, ":utf8"); print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), +"\n"; print "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}", "\n"; my $smiley_from_name = "\N{WHITE SMILING FACE}"; my $smiley_from_code_point = "\N{U+263a}"; print $smiley_from_name, "\n"; print $smiley_from_code_point, "\n"; my $dui = "\N{U+5C0D}"; print $dui, "\n"; die "\n", "\N{U+5C0D}", "\N{U+4E0D}", "\N{U+8D77}", ", the file $targe +t_file does not exist.\n\n" unless -e $target_file;
What I get is the following:


???, the file yyy does not exist.

What I expected to get was the following:


對不起, the file yyy does not exist.

So I am unable to figure out why some code points work just fine while others display only as ?.

Any suggestions?

Replies are listed 'Best First'.
Re: printing Unicode works for some characters but not all
by kcott (Archbishop) on Jun 04, 2017 at 06:45 UTC

    G'day fireblood,

    For generally troubleshooting this type of problem, you need to assess the Unicode abilities of all elements involved.

    Firstly, check that the code point is a valid Unicode code point with a printable character assigned to it. Note that, although the code point may be in a valid block, i.e. a range of code points, it may not be a printable character: it may be unassigned, reserved, a control character, or similar. See the "Unicode Code Charts".

    Next check Perl's capabilities. If you look in the Miscellaneous section of perldoc you'll find the perldelta pages. These will tell you which version of Unicode is supported by which version of Perl. They only tell you when a new Unicode version is supported, so that can take some hunting around: check the zero subversions (5.22.0, 5.24.0, etc.) first. For your version up to the latest:

    Perl versionUnicode version supported

    The Unicode::UCD module (UCD = "Unicode Character Databse") can provide you with a lot of other useful information. Here's just a few examples:

    Which Unicode version does your current Perl support. I'm using Perl 5.26, so it shows Unicode 9; you're using 5.22, so it should show Unicode 7.

    $ perl -E 'use Unicode::UCD; say Unicode::UCD::UnicodeVersion' 9.0.0

    What version of Unicode did a character first appear in (given by the "Age" property). Here's a couple: one from your post; one I happened to know was a recent addition.

    $ perl -E 'use Unicode::UCD "charprop"; say charprop("U+5C0D", "Age")' V1_1 $ perl -E 'use Unicode::UCD "charprop"; say charprop("U+1F9C0", "Age") +' V8_0

    If I switch to Perl 5.22, the output from that last command becomes:

    $ perl -E 'use Unicode::UCD "charprop"; say charprop("U+1F9C0", "Age") +' Unassigned

    Note that, in isolation, that output is indistinguishable from a code point which isn't actually assigned; however, if you did the "valid Unicode code point" check first, as suggested, you'll know the difference.

    $ perl -E 'use Unicode::UCD "charprop"; say charprop("U+1E95A", "Age") +' Unassigned

    [See Unicode code charts (PDF): "Supplemental Symbols and Pictographs" for U+1F9C0 (a recently added emoji which looks like a wedge of cheese); "Adlam" for U+1E95A (no special significance: Adlam was alphabetically first when searching for a block with an unassigned code point; U+1E95A just happened to be in a noticeable gap between assigned code points.]

    Next, you'll need to check the Unicode support available for your operating system, the application you're using to display the characters, fonts being used and so on. I don't have those available; however, this would (as far as I know) be valid from a Cygwin command line, and may provide some insight:

    $ perl -C -E 'say "\x{5c0d}"'
    $ echo "對"

    Note that I used <pre> tags for that last part. When showing characters outside the ASCII range, these are a better choice than <code> tags which will often just render them as entity references (e.g. &#x5C0D;).

    — Ken

      Was just investigating Unicode today, which was suggested by reading up on the new stuff in 5.26 .. and this post helped explain a number of questions. Great, great answer. Thank you.

      Alex / talexb / Toronto

      Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Hi Ken,

      Wow, your answer is so complete and detailed, you put a lot of effort into it. I appreciate your answer very much, it gives me a much deeper understanding of all of the factors that are involved in determining whether or not any given Unicode character can be displayed.

      I will apply your wisdom to my current project, and will upgrade to 5.26 as well. I didn't know that 5.26 was available already.

      Thanks again,
Re: printing Unicode works for some characters but not all
by Anonymous Monk on Jun 04, 2017 at 04:10 UTC
    utf8 is not same as UTF-8 layer, what your display show is not important in terms of binmode as long as the correct bytes are output -- not all unicode chars display