Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

encoding for $!

by powerman (Friar)
on Jun 28, 2014 at 20:57 UTC ( [id://1091565]=perlquestion: print w/replies, xml ) Need Help??

powerman has asked for the wisdom of the Perl Monks concerning the following question:

$! contains error message in locale's language and encoding as bytes, it isn't decoded into Unicode scalar:

$ LANG=ru_RU.UTF-8 perl -CS -MEncode=decode -E ' sub show { printf "is_utf8=%d, length=%d, msg=%s\n", utf8::is_utf8($_), length, $_; } $_=$!=111; show(); $_=decode("UTF-8", $_); show(); ' is_utf8=0, length=40, msg=Ð Ñоединении оÑказано is_utf8=1, length=21, msg=[here should be correct Russian "connection +refused" but perlmonks's parser turn it into html entities]

So, when we've nice Unicode-enabled application, which correctly write Unicode messages into UTF-8 log files, and then we get some error and try to write $! into log we get junk instead of error message.

It's clear what it's nearly impossible to manually decode('UTF-8') in each and every place we get $! - especially because this error may happens inside 3rd-party module. So, is it possible to somehow fix/workaround this issue without forcing LANG=en_US.UTF-8?

Replies are listed 'Best First'.
Re: encoding for $!
by Anonymous Monk on Jun 28, 2014 at 22:41 UTC

    start your search here errno, errno -> Errno::AnyString - put arbitrary strings in $!

    magic, magic -> Variable::Magic - Associate user-defined magic to variables from Perl.

    You'll realize that these magic variables look a lot like tied variables. It is not surprising, as tied variables are implemented as a special kind of magic, just like any 'irregular' Perl variable : scalars like $!, $( or $^W, the %ENV and %SIG hashes, the @ISA array, vec() and substr() lvalues, threads::shared variables... They all share the same underlying C API, and this module gives you direct access to it.

    t's nearly impossible to manually decode('UTF-8') in each and every place we get $! -

    Should have used an abstraction :)  errno(), "@{[errno()]}",  @{[int errno()]}

      Thanks for the pointers, it was fun to play, but neither helps.

      With Errno::AnyString I'm able to change string value of $!, but looks like it doesn't matter how I set it - to Unicode scalar or binary UTF8-encoded string - because it always fetched as binary UTF8-encoded string.

      Variable::Magic doesn't actually see when $! set on errors, and even on manual set it sees (and can fake on get) only numeric value.

      I'm now on perl-5.16.3 so I didn't see this in doc before, but looks like perldoc perlvar for 5.20 mention this issue: «Note that when stringified, the text is always returned as if both "use locale" and "use bytes" are in effect. This is likely to change in v5.22.»

Re: encoding for $!
by Anonymous Monk on Jun 29, 2014 at 14:41 UTC

    For future reference, the perldelta file for Perl 5.21.1 says:

    "$!" text is now in English outside "use locale" scope
    Previously, the text, unlike almost everything else, always came out based on the current underlying locale of the program. (Also affected on some systems is ""$^E"".) For programs that are unprepared to handle locale, this can cause garbage text to be displayed. It's better to display text that is translatable via some tool than garbage text which is much harder to figure out.

    "$!" text will be returned in UTF-8 when appropriate
    The stringification of $! and $^E will have the UTF-8 flag set when the text is actually non-ASCII UTF-8. This will enable programs that are set up to be locale-aware to properly output messages in the user's native language. Code that needs to continue the 5.20 and earlier behavior can do the stringification within the scopes of both 'use bytes' and 'use locale ":messages"'. No other Perl operations will be affected by locale; only $! and $^E stringification. The 'bytes' pragma causes the UTF-8 flag to not be set, just as in previous Perl releases. This resolves perl #112208.

    This looks to me like a good news/bad news situation. The good news is that help is on the way. The bad news is that it is a year away, and whatever you do to ameliorate the situation now will have to be redone when 5.22 comes out. Plus you will have to actually upgrade to 5.22 when it gets here.

Re: encoding for $!
by remiah (Hermit) on Jun 29, 2014 at 05:40 UTC

    Hello powerman, I also would like to know the solution for this...

    $_ would be used with interpolated string, like die "open $file failed $!", So, I would like to catch the timing for decode with tie variable or commandline option or hopefully, similar functionality like IO layer.

    I tried with tie. This seems working. So, I guess changing every $! variable will become adding 2 lines to the head of script.

    use DecodedMsg; tie $! , "DecodedMsg";
    Below is my tests. But I have to confess, I am not familiar with theses "tie" things. Hope comments of glues.
    use strict; use warnings; use Encode qw(decode encode); sub show{ printf "is_utf8=%d, len=%d, msg=%s\n", utf8::is_utf8($_), length($ +_), $_; } #byte string utf8 hiragana my $byte_error_msg=encode('UTF-8', join('', map{pack('U',$_)} 0x3042 . +. 0x3052)); print "#test1: with byets\n"; $_ = $byte_error_msg ; show; print "#test2: with decoded char\n"; $_ = decode("UTF-8", $byte_error_msg); show; print "#test3 with tied variable ... seems working\n"; {package DecodedMsg; use Encode qw(decode); sub TIESCALAR { my $class=shift; print "class=$class\n"; return bless( {} , $class ); } sub STORE{ my $self=shift; print "###in store arg=$_[0]\n"; $self->{val}= $_[0]; } sub FETCH { my $self=shift; if( utf8::is_utf8( $self->{val} ) ){ return $self->{val}; } else { print "#in fetch ... return decoded\n"; return decode('UTF-8', $self->{val}) } } 1; } tie $_, "DecodedMsg"; #tie $_ to DecodedMsg $_ = $byte_error_msg; show; print "#test4 with tied variable ... with system error\n"; tie $!, "DecodedMsg"; open(my $fh, "<", " /the/path/to/nowhere"); $_ =$!; show;

      My point was to find a solution which doesn't require fixing $! everywhere we get it. With tied $_ we should copy $! into $_ everywhere we get it, which doesn't really differ from doing decode('UTF-8',$!) inplace.

      Looks like best solution for now is force "en_US.UTF-8" (at least this can be done once at beginning of script) or wait until perl-5.22 (which may fix this issue).

      use POSIX qw(locale_h); BEGIN { setlocale(LC_MESSAGES,"en_US.UTF-8") }

        sorry, mistyped.

        use DecodedMsg; tie $! , "DecodedMsg";
        see test4 example. But I also think setting LANG to en_US.UTF-8 is better solution. regards.

Re: encoding for $!
by LanX (Saint) on Jun 28, 2014 at 21:21 UTC
    > It's clear what it's nearly impossible to manually decode('UTF-8') in each and every place we get $!

    Please elaborate why!

    If $! is logged into a file then there must be a routine in the middle handling this.

    So why can't you translate at this point?

    Cheers Rolf

    (addicted to the Perl Programming Language)

      I can and I do. But this routine log not only $! but any messages, and thus expect _correct_ scalars - i.e. if it contain non-ASCII chars then it must be Unicode string (utf8::is_utf8 should return true).

      At every point where app receive text from outside world (stdin,@ARGV,CGI params,network,database) it decode from bytes to Unicode using known encoding. At every point where app output text to outside world (stdout,files,logs,network,etc.) it encode Unicode into bytes using known encoding. Between input and output all text in scalars is in Unicode. Now, we get $! somewhere in the middle of the app, and turns out this is one more "receive from outside world" point where we should decode bytes to Unicode - but doing this everewhere we may get $! is too unconvenient.

        As was pointed out by an anonymous monk on Jun 29, 2014, Perl has an UFT-8 flag you can test(see perlunicode and related).

        Since you know the locale your program is running with, for any message your central logging routine receives that is not flagged as UFT-8, you could then convert it.

        (not tested, so not sure if this would work)

Re: encoding for $!
by Jim (Curate) on Jun 29, 2014 at 04:51 UTC

    What about the open pragma? Can you not use it to control how Russian language (Cyrillic script) error messages are decoded?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1091565]
Approved by farang
Front-paged by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-04-25 14:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found