Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^5: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255

by oiskuu (Hermit)
on Dec 10, 2014 at 20:35 UTC ( [id://1109950]=note: print w/replies, xml ) Need Help??


in reply to Re^4: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
in thread JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255

Or you could standardize the internal representation. A string is a sequence of code points. Storing the sequence length could be handy when dealing predominantly with string objects. Then the following cases arise:

  • (nbytes == 0 && ncodepts == 0) trivial case/empty/false
  • (nbytes > 0 && ncodepts == 0) binary blob
  • (nbytes > 0 && ncodepts == nbytes) with UTF-8 internal rep, this means string is plain ASCII
  • (nbytes > 0 && ncodepts < nbytes) generic unicode string

Extended 8-bit charsets (ISO8859) suffer with UTF-8 internal representation, unless you hack the (ncodepts==nbytes) to indicate native format...

More interesting is the interaction between objects. Considering a blob and a string object:

$foo = ($str . $obj); $bar = ($obj . $str); $baz = "${obj}${str}";
When is the blob promoted to a string, when does the opposite happen? Object representation and efficiency are certainly big concerns, but surely the semantic implications of unicode are far more insidious.

  • Comment on Re^5: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
  • Download Code

Replies are listed 'Best First'.
Re^6: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
by ikegami (Patriarch) on Dec 13, 2014 at 02:12 UTC

    Basically, you're suggesting changing the UTF8 flag to become a semantic indicator of a "decoded" string (along with the other changes necessary to make that happen). That might be possible, but it might be nicer if we could distinguish "binary (unknown)" from "binary (locale-encoded text)". But then again, the Windows API uses three encodings ("UNICODE", "ANSI" and "OEM").

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1109950]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-04-25 12:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found