Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^3: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255

by Anonymous Monk
on Dec 07, 2014 at 04:36 UTC ( [id://1109463]=note: print w/replies, xml ) Need Help??


in reply to Re^2: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
in thread JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255

It's not the basics, this is the problem. I'm certainly NOT blaming people for becoming confused... It's Perl's problem (ikegami disagrees).

Looking at the source of the test in question, is_sane_utf8 tests whether the string was improperly 'upgraded' (the so-called 'double encoding')... rejecting the JSON is more or less a side effect. Quickly, tell me, what that actually means?

  • Comment on Re^3: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
  • Download Code

Replies are listed 'Best First'.
Re^4: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
by ikegami (Patriarch) on Dec 07, 2014 at 05:27 UTC

    Damn right I disagree. It is not Perl's problem that someone using a function that's documented to check for accidental double-encoding to check if something is valid UTF-8. That's akin to using uc to get the first character of a string. There's nothing Perl can do to stop you from using a function completely unrelated to the one you want to use.

    This is the second time this thread you've implied that I maintain that Perl's handling of UTF-8 isn't confusing. That's a lie. The former bugs in Perl (some still present) and the plethora of buggy XS module (because XS is hard!) has led people like you to disseminate misinformation, which has created a self-feeding vicious loop of confused people. I've repeatedly said that Perl should be able to differentiate encoded strings from decoded strings and prevent you from mixing them.

    Speaking of misinformation, improper upgrading doesn't cause double-encoding. Quite the opposite, it causes a string encoded using UTF-8 to become decoded. (Upgrading a strings that isn't encoded using UTF-8 creates a corrupt scalar, as seen using perl -MDevel::Peek -MEncode=_utf8_on -we"$_ = qq{\x80}; _utf8_on($_); Dump($_)")

    Quickly, tell me, what that actually means?

    Double encoding is doing encode_utf8(encode_utf8($x)) when you mean to do encode_utf8($x).

      Damn right I disagree. It is not Perl's problem that someone using a function that's documented to check for accidental double-encoding to check if something is valid UTF-8. That's akin to using uc to get the first character of a string. There's nothing Perl can do to stop you from using a function completely unrelated to the one you want to use.
      Except it's kind of hard to understand what the heck the function is doing. 'flagged as utf8', 'store a string internally'... too many implementation details. Do you expect many people to understand it?
      This is the second time this thread you've implied that I maintain that Perl's handling of UTF-8 isn't confusing. That's a lie. The former bugs in Perl (some still present) and the plethora of buggy XS module (because XS is hard!) has led people like you to disseminate misinformation, which has created a self-feeding vicious loop of confused people. I've repeatedly said that Perl should be able to differentiate encoded strings from decoded strings and prevent you from mixing them.
      Maybe you missed that, ikegami... but I actually never have any problems with mojibake in my Perl code... unlike some other people. I know where these kinds of bugs come from and how to fix them. Works for me, eh?
      Speaking of misinformation, improper upgrading doesn't cause double-encoding. Quite the opposite, it causes a string encoded using UTF-8 to become decoded. (Upgrading a strings that isn't encoded using UTF-8 creates a corrupt scalar. perl -MDevel::Peek -MEncode=_utf8_on -we"$_ = qq{\x80}; _utf8_on($_); Dump($_)")
      I've called it 'upgrading' (in quotes) in honor of utf8::upgrade (perl -MDevel::Peek -CO -E 'my $s = "\xff"; Dump $s; say $s; utf8::upgrade($s); Dump $s; say $s' - note that I don't care one bit how Perl actually does that). Not sure why you even mentioned _utf8_on. Anyway, I really dislike this term 'double encoding', because that implies that the problem is with the encoding, and not the decoding part (encoding needs some decoding first). Why isn't double encoding utf-8 a no-op? Really, just explain it in your own words.

      (perl -MEncode=encode -E 'say encode("Latin-1", encode("Latin-1", "\xff")) doesn't seem to do much of anything?)

        Except it's kind of hard to understand what the heck the function is doing. 'flagged as utf8', 'store a string internally'... too many implementation details.

        This is my very problem with you: You bring up internal details for no reason. And these implementation details just end up confusing people, not helping them.

        Except it's kind of hard to understand what the heck the function is doing

        That a module is badly documented is not Perl's fault.

        Maybe you missed that, ikegami... but I actually never have any problems with mojibake in my Perl code...

        Yeah, I know you know you know better.

        I've called it 'upgrading' (in quotes) in honor of utf8::upgrade

        That doesn't double encode either. That doesn't change the string at all. (Remove the upgrade from your code and you get the same output.)

        Not sure why you even mentioned _utf8_on

        _utf8_on and utf8::upgrade both end up with an upgraded string, _utf8_one is the one used throughout the docs for Test::utf8, and your comment was wrong whichever function you were talking about.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1109463]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-04-23 14:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found