http://qs321.pair.com?node_id=814129

jbert has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks,

It seems to me that mod_perl2 isn't doing the right thing wrt utf8.

This seems unlikely to me (it's widely used software), so I'd appreciate a second opinion.

Here are two ways of constructing a string containing the unicode character U+00B4 (ACUTE ACCENT):

binmode STDOUT, ':utf8'; my $strA = "Bob\x{B4}s files"; use Encode; my $strB = Encode::decode_utf8("Bob\xC2\xB4s files"); say "strA is $strA"; say "strA is $strB"; say "strings are " . (($strA eq $strB) ? "eq" : "not eq"); say "strA flag is: ".Encode::is_utf8($strA); say "strB flag is: ".Encode::is_utf8($strB);

Because the U+00B4 is within latin1 perl is willing and able to treat strA as a byte-string (no utf8 flag set) as per perldoc perlunicode docco for \x: for characters under 0x100, note that Perl may use an 8 bit encoding internally, for optimization and/or backward compatibility.

If I have a mod_perl2 handler which just sends text/plain utf8 content and send those two strings via $r->print then I see different results in the browser (strA doesn't render correctly). Note that $r->binmode seems to do nothing.

To me, this is a mod_perl bug. In perl, the strings are equivalent. It seems mod_perl2 is ignoring the is_utf8 flag and only sending the raw internal representation (latin1 for strA and utf8 for strB).

That said, this has surely been gone over before, so could someone please put me straight.

(Workaround 1 - avoid use of \x escapes and either use decode_utf8 to build string literals or the use utf8; pragma and literal utf8 sequences in your source.)

(Workaround 2 - if your strings come from an external source (e.g. Data::Dumper) you can do:

if (!Encode::is_utf8($str) && $str =~ m/\x80-\xff/) { $str = Encode::encode_utf8($str); # Get utf8 byte seq Encode::_utf8_on($str); # and flip the flag }

But that's fugly since it assumes loads about internals and has a runtime cost which isn't always fun.

Replies are listed 'Best First'.
Re: mod_perl2 and utf8
by ikegami (Patriarch) on Dec 23, 2009 at 17:59 UTC

    It all comes down to:

    You can't output characters. You can only output bytes. If you want to output characters, you'll need to encode them somehow.

    You didn't do that.

    If I have a mod_perl2 handler which just sends text/plain utf8 content and send those two strings via $r->print then I see different results in the browser (strA doesn't render correctly). Note that $r->binmode seems to do nothing.

    You've shown that $r->print's expects a string of bytes just like the builtin print. If you want to output characters, you need to encode them manually or by telling the object to do it for you (such as by using PerlIO layer :utf8 or :encoding on a file handle) first.

    The only reason $strB works is that $r->print does the best it can with an invalid input. You should get "Wide character" warnings alerting you to that fact.

    You said binmode doesn't work on $r, so that leaves you with the option of doing it manually.

    Fix:

    $r->print($strA); # XXX $r->print($strB); # XXX
    should be
    $r->print(Encode::encode_utf8($strA)); $r->print(Encode::encode_utf8($strB));
    or
    utf8::encode my $strA_utf8 = $strA; utf8::encode my $strB_utf8 = $strB; $r->print($strA); $r->print($strB);

    Update: Adjusted phrasing

      I would expect $r->print to accept a string of bytes just like every other print.

      The posted code shows that printing to STDOUT in a cmdline script gives the same result for both strings (because non-mod-perl2 STDOUT has an associated encoding (latin1 by default, changeable to utf8) - either will work if it matches your terminal).

      Printing to STDOUT (or using r->print) under apache does not have this property - you get different behaviour for the two approaches.

      i.e. the problem as I see it is that STDOUT under mod_perl2 lacks the utf8-awareness built through the rest of the perl I/O layer, with no way of enabling it.

      Yes, you can manually encode, but that can necessitate additional copying (you can pass $r as an output destination to TT, which will do the wrong thing if you're working with unicode strings, since it won't call encode. Yes, you can build to a scalar, encode that and print that but it's a shame the extra copy is needed when perl has a mechanism for this which isn't being used.

        I would expect $r->print to accept a string of bytes just like every other print.

        The posted code shows that printing to STDOUT in a cmdline script gives the same result for both strings

        I followed up by saying you can instruct print to accept characters by telling it how to handle them. This is done on a per-handle basis, and that's what you did for STDOUT with

        binmode STDOUT, ':utf8';

        You need to do something equivalent with mod_perl's object.

        the problem as I see it is that STDOUT under mod_perl2 lacks the utf8-awareness built through the rest of the perl I/O layer, with no way of enabling it.

        Not knowing anything about the class except what you've told me, I agree. File a bug report.

Re: mod_perl2 and utf8
by ikegami (Patriarch) on Dec 23, 2009 at 18:35 UTC

    By the way,

    if (!Encode::is_utf8($str) && $str =~ m/\x80-\xff/) { $str = Encode::encode_utf8($str); # Get utf8 byte seq Encode::_utf8_on($str); # and flip the flag } $r->print($str);
    is the same as
    utf8::upgrade($str); $r->print($str);

    But as previously explained, the correct solution is

    utf8::encode($str); $r->print($str);
Re: mod_perl2 and utf8
by WizardOfUz (Friar) on Dec 23, 2009 at 19:07 UTC

    Setting an I/O layer should work. See:

    • mod_perl-2.0.4/t/response/TestModperl/print_utf8.pm
    • mod_perl-2.0.4/t/response/TestModperl/print_utf8_2.pm

    Did you run the corresponding tests?

    • mod_perl-2.0.4/t/modperl/print_utf8.t
    • mod_perl-2.0.4/t/modperl/print_utf8_2.t