JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255

Ovid has asked for the wisdom of the Perl Monks concerning the following question:

Update: Perlmonks seems to have trouble rendering some of this. The question is also at stackoverflow.

I'm getting some corrupted JSON and I've reduced it down to this test case.

    use utf8;
    use 5.18.0;
    use Test::More;
    use Test::utf8;
    use JSON::XS;

    BEGIN {
        # damn it
        my $builder = Test::Builder->new;
        foreach (qw/output failure_output todo_output/) {
            binmode $builder->$_, ':encoding(UTF-8)';
        }
    }

    foreach my $string ( 'Deliver «French Bread»', '&#26085;&#26412;&#
+22269;' ) {
        my $hashref = { value => $string };
        is_sane_utf8 $string, "String: $string";
        my $json = encode_json($hashref);
        is_sane_utf8 $json, "JSON: $json";
        say STDERR $json;
    }
    diag ord('»');

    done_testing;
[download]

And this is the output:

    utf8.t .. 
    ok 1 - String: Deliver «French Bread»
    not ok 2 - JSON: {"value":"Deliver Â«French BreadÂ»"}

    #   Failed test 'JSON: {"value":"Deliver Â«French BreadÂ»"}'
    #   at utf8.t line 17.
    # Found dodgy chars "<c2><ab>" at char 18
    # String not flagged as utf8...was it meant to be?
    # Probably originally a LEFT-POINTING DOUBLE ANGLE QUOTATION MARK 
+char - codepoint 171 (dec), ab (hex)
    {"value":"Deliver «French Bread»"}    
    ok 3 - String: &#26085;&#26412;&#22269;
    ok 4 - JSON: {"value":"æ&#151;¥æ&#156;¬å&#155;½"}
    1..4
    {"value":"&#26085;&#26412;&#22269;"}
    # 187
[download]

So the string containing guillemets («») is valid UTF-8, but the resulting JSON is not. What am I missing? The `utf8` pragma is correctly marking my source. Further, that trailing 187 is from the diag. That's less than 255, so it almost looks like a variant of the old Unicode bug in Perl. (And the test output still looks like crap. Never could quite get that right with Test::Builder).

Switching to `JSON::PP` produces the same output.

Further testing reveals the failure for all characters in range 127 to 255.

This is Perl 5.18.1 running on OS X Yosemite.

Cheers,
Ovid

Live and work anywhere in the world.

Comment on JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 Select or Download Code

Replies are listed 'Best First'.
Re: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by Corion (Patriarch) on Dec 06, 2014 at 20:41 UTC
The following program passes its tests for me, side-stepping the source code encoding: use utf8; use 5.16.0; use Test::More; use Test::utf8; use JSON::XS; BEGIN { # damn it my $builder = Test::Builder->new; foreach (qw/output failure_output todo_output/) { binmode $builder->$_, ':encoding(UTF-8)'; } } foreach my $string ( 'Deliver \N{LEFT-POINTING DOUBLE ANGLE QUOTAT +ION MARK}French Bread\N{RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK}', + '日本国' ) { my $hashref = { value => $string }; is_sane_utf8 $string, "String: $string"; my $json = encode_json($hashref); is_sane_utf8 $json, "JSON: $json"; say STDERR $json; } diag ord('»'); done_testing; [download] On the Windows command line, I get `ord('»')` also as 187 for `perl -wle "print ord('»')"`... I'm not sure what to make of this, but I feel confirmed in my intention to have no characters above code 127 in my source code... Thanks for introducing me to Test::utf8 - this feels useful when diagnosing mojibake issues.	[reply] [d/l] [select]
Re: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by BrowserUk (Patriarch) on Dec 07, 2014 at 03:19 UTC
This post, and threads like it, make me wonder when programmers will wake up to the fact that unicode is one monumental cockup; and that until programmers collectively start rejecting it, rather than suffering it, they are enabling its continuance rather than promoting the search for a cure. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by karlgoethebier (Abbot) on Dec 07, 2014 at 10:50 UTC
"...unicode is a one monumental cockup...start rejecting it...search for a cure." Dear BrowserUk, i'm d’accord with you when i remember how often i bothered myself with unicode related stuff. But what would be the cure? Best regard, Karl P.S.: And i really assume that your post isn't a very sophisticated joke that i didn't catch. «The Crux of the Biscuit is the Apostrophe»	[reply]
Re^3: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by BrowserUk (Patriarch) on Dec 07, 2014 at 11:39 UTC
But what would be the cure? Stop pretending that unicode is 'forwards compatible' from ASCII. The least useful property of unicode is that a trivial subset of it can appear to be 'simple text'. Stop pretending that unicode isn't a binary format. Every other binary format in common use, self-identifies through the use of 'signatures'. Eg. "GIF87a" & "GIF89a". Recognise that unicode isn't a single format, but many formats all lumped together in a confused and confusing mess. Some parts have several names, some of which are deprecated. Other associated terms have meant, and in some cases still do mean, two or more different things. Recognise that there is no need and no real benefit to the "clever" variable length encoding used by some of the formats. It creates far more problems than it fixes; and is the archetypal 'premature optimisation' that has long since outlived its benefit or purpose. Keep the good stuff -- the identification and standardisation of glyphs, graphemes and code points -- and rationalise the formats to a single, fixed-width, self-identifying format. Just imagine how much simpler, safer, and more efficient it would be if you could read the first few bytes of a file and know what it contains. Imagine how much more efficient it would be if to read the 10 characters starting at the 1073741823th character of a file, you simply did (say): `seek FH, 1073741823 * 3 + SIG_SIZE, 0; read( FH, $in, 10 * 3 );` [download] Instead of having to a) guess the encoding; b) read all the bytes from the beginning counting characters as you go. Imagine all the other examples of stupid guesswork and inefficiency that I could have used. Imagine not having to deal with any of them. Imagine that programmers said "enough is enough"; give us a simple, single, sane, self-describing format for encoding the world's data. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by oiskuu (Hermit) on Dec 07, 2014 at 16:45 UTC
Re^5: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by BrowserUk (Patriarch) on Dec 07, 2014 at 18:05 UTC
Re^4: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by karlgoethebier (Abbot) on Dec 07, 2014 at 16:06 UTC
Re^4: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by ikegami (Patriarch) on Dec 10, 2014 at 07:46 UTC
Re^5: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by BrowserUk (Patriarch) on Dec 10, 2014 at 09:48 UTC
Some notes below your chosen depth have not been shown here
Re^5: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by oiskuu (Hermit) on Dec 10, 2014 at 20:35 UTC
Some notes below your chosen depth have not been shown here
Re: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by basiliscos (Pilgrim) on Dec 06, 2014 at 20:59 UTC
The provided test sample on my Perl v5.20.1 / Linux passes. WBR, basiliscos.	[reply]
Re: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by Anonymous Monk on Dec 06, 2014 at 21:04 UTC
Ha! What would you say now, ikegami? When people like James Keenan and Ovid don't understand how this stuff works... can less experienced programmers even hope to ever get this right? So the string containing guillemets («») is valid UTF-8, but the resulting JSON is not. What am I missing? The `utf8` pragma is correctly marking my source. JSON::XS says: (encode_json) Converts the given Perl data structure to a UTF-8 encoded, binary string (that is, the string contains octets only). Croaks on error. Test::utf8 says: (is_sane_utf8) This test fails if the string contains something that looks like it might be dodgy utf8, i.e. containing something that looks like the multi-byte sequence for a latin-1 character but perl hasn't been instructed to treat as such... This test fails when... The string contains utf8 byte sequences and the string hasn't been flagged as utf8 (this normally means that you got it from an external source like a C library; Apparently it tests whether the string was properly decoded... (I'm not familiar with it). I guess you need to `Encode::decode_utf8` it, before feeding the string to the second `is_sane_utf8` (Test::utf8 has an example, with `Encode::_utf8_on`)	[reply] [d/l] [select]
Re^2: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by ikegami (Patriarch) on Dec 07, 2014 at 04:20 UTC
Ha! What would you say now, ikegami? That Ovid used a function without reading what it does first. My exact words are here.	[reply]
Re^2: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by Anonymous Monk on Dec 07, 2014 at 03:11 UTC
Ha! What would you say now, ikegami? When people like James Keenan and Ovid don't understand how this stuff works... can less experienced programmers even hope to ever get this right? So Ovid got confused about the basics, when dealing with some Test::: extas, so what? Its ok to get confused	[reply]
Re^3: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by Anonymous Monk on Dec 07, 2014 at 04:36 UTC
It's not the basics, this is the problem. I'm certainly NOT blaming people for becoming confused... It's Perl's problem (ikegami disagrees). Looking at the source of the test in question, `is_sane_utf8` tests whether the string was improperly 'upgraded' (the so-called 'double encoding')... rejecting the JSON is more or less a side effect. Quickly, tell me, what that actually means?	[reply] [d/l]
Re^4: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by ikegami (Patriarch) on Dec 07, 2014 at 05:27 UTC
Re^5: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255 by Anonymous Monk on Dec 07, 2014 at 06:53 UTC
Some notes below your chosen depth have not been shown here


Come for the quick hacks, stay for the epiphanies.
	PerlMonks