Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Is there some universal Unicode+UTF8 switch?

by daxim (Curate)
on Sep 01, 2019 at 17:17 UTC ( #11105384=note: print w/replies, xml ) Need Help??


in reply to Is there some universal Unicode+UTF8 switch?

Try utf8::all. It's not universal, because it handles only the core functionality, not libraries. Your use case can be much simplified, though. I strongly suspect you have too much code. Consider:
  • HTTP::Response provides both content (returns octets) and decoded_content (returns characters, appropriately decoded from Content-Type header).
  • decode_json wants to consume octets.
This means the following DWYW:
use LWP::UserAgent qw(); use JSON::MaybeXS qw(decode_json); my $ua = LWP::UserAgent->new; my $res = $ua->get('https://ru.wikipedia.org/w/api.php?action=query&fo +rmat=json&formatversion=2&list=allusers&auactiveusers&aufrom=%D0%91') +; die $res->status_line unless $res->is_success; my $json_OCTETS = $res->content; my $all_users_CHARACTERS = decode_json $json_OCTETS; my $continue_aufrom_CHARACTERS = $all_users_CHARACTERS->{continue}{auf +rom};
Your CGI script's templating system should take care to produce UTF-8 encoded octets. If you don't have one, then either one of
  • use Encode qw(encode); my $continue_aufrom_OCTETS = encode('UTF-8', $continue_aufrom_CHARACTE +RS, Encode::FB_CROAK); STDOUT->print($continue_aufrom_OCTETS);
  • binmode STDOUT, ':encoding(UTF-8)'; STDOUT->print($continue_aufrom_CHARACTERS);
is appropriate. The first variant is more robust.

Replies are listed 'Best First'.
Re^2: Is there some universal Unicode+UTF8 switch?
by VK (Novice) on Sep 01, 2019 at 19:39 UTC

    use utf8::all; sounds the most promising, thank you and I'll try it. The reason I didn't use it yet is that the main doc https://perldoc.perl.org/utf8.html doesn't have a single mention of this option - so either you know about utf8::all in advance, or you are out of luck.

    The JSON function shortcut decode_json has UTF8 decoding hardcoded to "on". To make it "off" and to avoid double encoding I had to use the full call like JSON->new->utf8(0)->decode($response->content) If utf8::all solves this problem as well, then I can use the function shortcut. I will check everything later today.

    (Update) Noop, I rechecked - only the current long code reliably working for non-ASCII. For the sample URL above I do my $response = LWP call and then

    1. my $data1 = JSON->new->utf8(0)->decode($response->content);
    2. my $data2 = decode_json($response->content);
    3. my $data3 = $response->decoded_content;
    and then my $test = $data1->{query}->{allusers}[0]->{name};

    1) is always working for my needs. 2) is woking if called in some obvious scalar context. One tries slice referenced array or anything complex - it falls to the "Perl branded jam" with — – and the like. 3) is stably DOA (dead on arrival) so the same 2) but right away.

    So utf8::all should be written and extended to some utf8::all_throughout "Written" means to the reliability and stability level to be included in prominent Perl distributions. Until then the answer to my initial question seems negative.

      slice referenced array or anything complex - it falls to the "Perl branded jam"
      I'm sceptical about that claim. Show your code.
Re^2: Is there some universal Unicode+UTF8 switch?
by Anonymous Monk on Sep 02, 2019 at 11:29 UTC
    Just stylin:
    STDOUT->binmode(':encoding(UTF-8)'); STDOUT->print($continue_aufrom_CHARACTERS);

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11105384]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2020-10-01 09:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If at first I donít succeed, I Ö










    Results (177 votes). Check out past polls.

    Notices?