Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

JSON::XS and unicode

by kimmel (Scribe)
on Sep 08, 2012 at 19:16 UTC ( [id://992521]=perlquestion: print w/replies, xml ) Need Help??

kimmel has asked for the wisdom of the Perl Monks concerning the following question:

I have run into what I believe is a bug when passing unicode text to decode_json. I am using JSON::XS 2.33 which is the latest version. I also checked the below programs and the source files I was using before with File and isutf8 (from moreutils) both of which report everything is proper UTF-8 unicode text. Here is the broken code:

#!/usr/bin/perl use v5.14; use warnings; use utf8::all; use JSON::XS qw( decode_json ); use Data::Dumper; my $wl = '{"creche": "crèche", "¥": "£", "₡": "волн" }'; my $pattern_list = decode_json( $wl ); print Dumper $pattern_list;

Generates the following message: 'Wide character in subroutine entry at ./micro_test.pl line 22.' Now I tried switching the decode_json call to my $pattern_list = JSON::XS->new->utf8->decode( $wl ); but I got the same error message. Now if I save that data to a file and then load the file I get no error message.

#!/usr/bin/perl use v5.14; use warnings; use utf8::all; use JSON::XS qw( decode_json ); use File::Slurp qw( read_file ); use Data::Dumper; my $wl = '{"creche": "crèche", "¥": "£", "₡": "волн" }'; open my $fh, '>', 'test_file2'; say {$fh} $wl; close $fh; my $pattern_list = decode_json( read_file('test_file2') ); print Dumper $pattern_list;

Is there a step missing from the first program that I am unaware of? I expected the first program to just work since the POD for JSON::XS states that decode_json expects UTF-8. I also tried JSON::PP and it gave me different errors.

#!/usr/bin/perl use v5.14; use warnings; use utf8::all; use JSON::PP qw( decode_json ); use File::Slurp qw( read_file ); use Data::Dumper; my $wl = '{"creche": "crèche", "¥": "£", "₡": "волн" }'; open my $fh, '>', 't2'; say {$fh} $wl; close $fh; my $pattern_list = decode_json( read_file('t2') ); print Dumper $pattern_list;

, or } expected while parsing object/hash, at character offset 21 (before "\n") at ./micro_test.pl line 28

Replies are listed 'Best First'.
Re: JSON::XS and unicode
by chip (Curate) on Sep 08, 2012 at 20:57 UTC
    As is usual with Mark Lehmann's documentation, you must read it very carefully and take it literally. Note that while Perl's test for Unicode bit is named "is_utf8()", the JSON::XS meaning of "utf8" is more correct -- it uses the term to refer to *bytes* that follow the UTF-8 encoding rules. And it's on by default.

    Try turning it *off*. And, not to be snarky, but read the docs very carefully. You just have to.

        -- Chip Salzenberg, Free-Floating Agent of Chaos

      Re-reading JSON::XS there was one key word I missed the first time around. Here is the relevant POD with my emphasis added.

      $perl_scalar = decode_json $json_text

      The opposite of encode_json: expects an UTF-8 (binary) string and tries to parse that as an UTF-8 encoded JSON text, returning the resulting reference. Croaks on error.
      So I need to encode before passing it to decode_json(). Here is the working program
      #!/usr/bin/perl use v5.14; use warnings; use utf8::all; use Encode; use JSON::XS qw( decode_json ); my $wl = '{"creche": "crèche", "¥": "£", "₡": "волн" }'; my $pattern_list = decode_json( encode("utf8", $wl) );

      Hello.
      There seems to be 2 problems.

      One is JSON::XS expects 'encoded utf8' string as default, as you point out.
      Second is utf8::all doesn't affect Slurp's io layer. When OP output to file and read it with Slurp's read_file, it is 'encoded utf8', not 'decoded utf8'. So the second example seems to succeed at a glance.

      use strict; use warnings; use JSON::XS qw( decode_json ); use Data::Dumper; binmode(STDOUT,":encoding(UTF-8)"); sub _p{return pack('U',$_[0])}; my ($wl,$pattern_list); #create utf8 decoded(perl internal utf8) JSON character. $wl = '{"creche": "cr'._p(0xE8).'che",'; $wl.= '"'._p(0xA5).'" : "'._p(0xA3).'",'; $wl.= '"'._p(8353).'": "'._p(1074)._p(1086)._p(1083)._p(1085).'"'; $wl.= '}'; #example 1 of OP sub ex1 { my $pattern_list; #$pattern_list = decode_json($wl); #Wide character in subroutine e +ntry #$pattern_list = JSON::XS->new->utf8(1)->decode($wl);#Wide charact +er in subroutine ent #no warning: it seems this module expects encoded utf8 but decoded + utf8 by default $pattern_list = JSON::XS->new->utf8(0)->decode($wl); } #ex1(); #print Dumper $pattern_list; #example 2 sub ex2 { use File::Slurp qw( read_file ); use utf8::all; open my $fh, '>:encoding(UTF-8)', 'test_file2'; print {$fh} $wl; close $fh; #here utf8::all failed to set Slurp's io layer my $buffer= read_file('test_file2'); print utf8::is_utf8($buffer) ? "buffer:utf8 flagged\n" : "buffer:n +ot utf8 flagged\n"; #you get 'encoded utf8 bytes and that is default for JSON::XS $pattern_list = decode_json( $buffer); #pattern_list is encoded utf8 string, not decoded print utf8::is_utf8($pattern_list) ? "pattern:utf8 flagged\n" : "p +attern:not utf8 flag } ex2(); print Dumper $pattern_list;
      JSON::XS's utf8 seems to me very different from other modules like DBD:: modules, Template's binmode option.

Re: JSON::XS and unicode
by philiprbrenan (Monk) on Sep 08, 2012 at 20:02 UTC

    You are trying to write a unicode encoded character to a file but the file has not been opened with an encoding that can cope. You need something like this (untested):

    {open(my $F, ">:encoding(UTF-8)", $f) or die "Cannot open $f"; say {$F} $s; }

      I am using utf8::all which already takes care of turning on the different unicode parts.

      The errors do not come from reading or writing a file. The error is caused by using decode_json() on a string which should work.

        Thanks for the pointer to utf8::all, it'll be handy.

            -- Chip Salzenberg, Free-Floating Agent of Chaos

        Could you replace the characters in the input to JSON with a's one by one until the offending wide character disappears? We will then know exactly which input character is causing the problem. Thanks.

Re: JSON::XS and unicode
by fluffyvoidwarrior (Monk) on Sep 10, 2012 at 10:52 UTC
    This may not be a very popular viewpoint but I would question the security wisdom of using JSON as a vehicle to transport code objects between client side javascript and server side perl. I think it unwise primarily because the json data cannot be properly washed of potential exploit code without breaking the javascript object functionality. It is relatively easy to write your own json equivalent transport system (in Perl of course) using secure objects that maintain data typing and rebuild these on the client after XMLHttpRequest exchanges. If you're creating a web app you MUST MUST MUST wash ALL input or you WILL get screwed over....sooner or later. Just my opinion, it would be interesting to hear others.

      Neither JSON::XS nor the natvie JSON parsers in Javascript execute Javascript code. They all (should) parse the text and reconstruct the data structure using a JSON parser, not the Javascript eval statement, exactly for the reason of not allowing easy Javascript code execution within the page context.

      The web application should support that by sending the appropriate content type - which is application/json, at least according to RFC 4627.

      Using JSONP sacrifices that security for the convenience of circumventing the same origin policy.

        Indeed, yes. No-one in their right mind would actively and intentionally execute passed code. Maybe I'm paranoid, but I won't have executable code coming in from the network under any circumstances. Before I'll accept it, it must be not capable of being executed. Accepting executable code and then trusting that it will never be called... erm... Just seems to me that if you were going to attack a system you would look very closely at JSON and related parsing mechanisms because code is, by definition, already accepted. You're already half way there. In fact you have been supplied with a ready built framework for injecting your malice. After all, if it isn't runnable code it isn't JSON and PHP plus JSON seems like a perfect storm. It's just not likely to be failsafe. At least Perl can be made failsafe.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://992521]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2024-04-23 14:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found