Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

[not perl] unicode/utf8 in browsers and OS's - where does conversion happen?

by danmcb (Monk)
on Jan 05, 2008 at 18:52 UTC ( [id://660558]=perlquestion: print w/replies, xml ) Need Help??

danmcb has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

Please forgive (mea culpa) that this is not really a perl question. But I thought it was an interesting question, and should concern anyone involved with web programming, in any form. And there are smart people here who might even know the answer.

The question is this : we all know that unicode should generally get transmitted over the wire by the browser as UTF8 (provided that the form is setup correctly and so on). But what happens when javascript grabs that input and does some AJAX tomfoolery with it? SHould the javascript see the input already converted to UTF8, or unicode? Or is the answer "not defined"?

And even more tricky - when a user enters data into a form by unknown method (they could but using MS regional options to specify the keyboard type, for instance Turkish Q, or they might use special software to input Devnagari, hopefully as UTF8, or they may just cut paste from god knows where) - what should/does the OS (almost always Windows) do? COnvert into UTF8 because the form wants it? Just cut/paste and let the browser sort it out?

This seems fraught with issues because you cannot really tell that a string is a UTF8 string just by looking at it. (You *might* be able to tell that it is *not* one ...)

If anyone can assist my poor addled brain, which really shouldn't be dealing with this right after 2 days of flu, I will become most eternally unjustifiable about it all. Thanks.

  • Comment on [not perl] unicode/utf8 in browsers and OS's - where does conversion happen?

Replies are listed 'Best First'.
Re: [not perl] unicode/utf8 in browsers and OS's - where does conversion happen?
by ikegami (Patriarch) on Jan 05, 2008 at 22:23 UTC

    But what happens when javascript grabs that input

    Could you be more precise? A function that reads from the socket should return bytes. A function that returns the text of an XML node should return decoded chars. It all depends on the interface.

    It's a question that's easily answered by trying.

    when a user enters data into a form

    Just a quicky to hold you until someone comes along with more info...

    This is very rough area. IIRC, the agent is suppose to encode the data using the same encoding as the page on which the form resides. However, I remember having a discussion about how it isn't well supported and/or there are issues with the approach.

    Of particular interest is that some browsers (possibly all the major ones) will populate a specific field with the encoding when the field is provided in the form. I can't remember what the field is called.

      what I mean by the javascript thing is:

      var data = myform.mytextarea.value;

      i.e. just grabbing the data out of the textarea directly.

        The javascript should see the characters, and should not see the bytes which represent those characters under utf-8, utf-16, iso-8859-1, ascii, windows-1252 or any other character encoding.
Re: [not perl] unicode/utf8 in browsers and OS's - where does conversion happen?
by kirillm (Friar) on Jan 05, 2008 at 22:42 UTC

    This is very good question and I'm looking forward to all good responses. I'd like to add my two cents though.

    Personally I just generate the page in utf-8, put into the content-type header that charset is utf-8 and add the accept-charset attribute to the form tag and hope that the client does the right thing:

    <form [...] accept-charset="utf-8"> [...] </form>

    This seems to work fine and the content from the clients does come in utf-8.

      right. I have been doing that do. But if the submision happens via AJAX rather than the normal browser form submision, does the javascript still see utf8? Always?

      A little googling suggests that support for accept-charset is patchy across browsers, so I have been wondering about relying on this. Haven't done any hard tests though.

Re: [not perl] unicode/utf8 in browsers and OS's - where does conversion happen?
by graff (Chancellor) on Jan 06, 2008 at 04:59 UTC
    I tried a perl one-liner to produce some "typical" wide characters:
    perl -e 'print join(" ",map {chr()} (0xe0 .. 0xe8)),"\n"'
    I ran that in a variety of different terminal environments, using different (single-byte) character sets, and pasted each of the outputs into the PM text box to create this post. The first output line comes from an iso-8859-1 xterm (i.e. Latin-1):

    à á â ã ä å æ ç è

    The next was pasted from a utf8 macosx Terminal window (I had to add the "-CS" option on the perl command line, just for this one run, to avoid the "Wide character in print" warning):

    à á â ã ä å æ ç è

    Same perl one-liner, in a Terminal using iso-8859-5 (Cyrillic -- no "-CS" option):

    р с т у ф х ц ч ш

    Here's another version of Cyrillic -- koi8:

    Ю А Б Ц Д Е Ф Г Х

    Here's 8859-3 (Greek):

    ΰ α β γ δ ε ζ η θ

    And just to be really perverse, here is 8859-6 (Arabic):

    ـ ف ق ك ل م ن ه و

    My Safari browser's "View->Text Encoding" is set to "Default" (whatever that means), and I was intrigued by the fact that each string of nine characters showed up exactly as intended, appearing exactly the same as in the original terminal that I copied from. Presumably, Safari and macosx are doing some deep magic here, "doing the right thing" with non-ascii character data in accordance with my current terminal setting, but keeping track of everything "under the covers" as unicode characters (otherwise, the Safari text box would not be able to show all those different characters at the same time).

    The results from hitting the "preview" button confirms that the characters are being pasted as unicode code points. Curiously, the text box that comes with the preview page shows the non-latin1 strings as numeric character entities (but since I did not put the strings into <code> tags, these entities show up as the intended characters in the main page display). No telling what might happen with Firefox or IE, or whether the behavior of other browsers might depend on your choice of OS. I'll leave that as an exercise... ;)

    you cannot really tell that a string is a UTF8 string just by looking at it. (You *might* be able to tell that it is *not* one ...)

    Actually, when it's a question of recognizing utf8 vs. just about anything else, it's not at all hard to determine with confidence that "it's definitely utf8" or "it's definitely not utf8". Encode::Guess is good for making this distinction, and it would also do quite well (in most cases) on UTF-16 (BE or LE). There are numeric properties of utf8 that are quite distinctive (very unlikely to occur in other types of data), and UTF16 is a safe bet when you see a regular pattern of null bytes next to 0x0A (line-feed) bytes.

    If you have data of indeterminate origin that is clearly not UTF8 or UTF16, and you don't have any external knowledge to give you clues, then it gets a lot harder to figure out what sort of text data you're dealing with -- it can be done, if you have enough known data for each likely language/encoding combination to build good statistical models (probabilities of byte values or byte ngrams), and enough observable "unknown" data in a given language/encoding for Baysian arithmetic to be reliable.

      "There are numeric properties of utf8 that are quite distinctive (very unlikely to occur in other types of data) ..."

      Well, yes, you can usually say that if something decodes OK as utf8, it probably *is* utf8. But it *will* also be a valid chunk of extended ASCII, or any other charset that makes use of all 256 possibilities for each octet (not elegantly put, but I hope you see my point).

      And probably is not the same as *is*. Is it really a problem in practice? I'm not sure. Maybe not. Hey, I'm just asking, OK? I like imagining things all going wrong - it's my job ... ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://660558]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-23 21:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found