Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

UTF-8 and browsers - Update

by amonroy (Scribe)
on Feb 14, 2005 at 03:37 UTC ( #430672=perlquestion: print w/replies, xml ) Need Help??

amonroy has asked for the wisdom of the Perl Monks concerning the following question:

Sorry if this is off-topic, but I need some advice from people who have worked with UTF-8.

Do you know why Firefox (so far I have tested under Windows, Linux and MacOS X) does not show an when I send the following data to the browser:

print $cgi->header(-charset => "utf-8"), "\x{4F}\x{CC}\x{88}";

Firefox shows an O followed by the dieresis instead of a "clean" O with dieresis. Other browsers such as IE (under MacOS and WinXP) and Safari handle this correctly.

Update

The sequence \x{4F}\x{CC}\x{88} represents the Unicode LATIN CAPITAL LETTER O followed by a DIAERESIS, which in theory should be displayed just like unicode LATIN CAPITAL LETTER O WITH DIAERESIS (\x{C3}\x{96}).

But I found that Firefox might have a bug displaying Unicode canonical equivalents.

Now, I have a less off-topic question: How can I convert the Unicode LATIN CAPITAL LETTER O followed by a DIAERESIS to LATIN CAPITAL LETTER O WITH DIAERESIS. Or in hexadecimal terms, how can I convert \x{4F}\x{CC}\x{88} to \x{C3}\x{96}?

I tried using all the normalization forms of Unicode::Normalizer, but no luck.

Replies are listed 'Best First'.
Re: UTF-8 and browsers - Update
by theorbtwo (Prior) on Feb 14, 2005 at 06:35 UTC

    In general, you shouldn't try to utf8-encode yourself -- instead, let perl do the work of encoding, and just give the abstract codepoints. This is, in fact, what is causing your problem. You specified character 0x4F, followed by character 0x308, that is, "LATIN CAPITAL LETTER O", "COMBINING DIAERESIS". I think you wanted that to be 0xA8, "DIAERESIS".


    Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

Re: UTF-8 and browsers - Update
by dakkar (Hermit) on Feb 14, 2005 at 12:39 UTC

    Bug in Firefox. It should work as you describe.

    As for the composition: first of all, work on characters, or at least or codepoints, not on utf-8 bytes. Second, you want Unicode Normal Form C (see Unicode::Normalize), so that you can write:

    use Unicode::Normalize; use charnames ':full'; # this is just to make things easier in this ex +ample binmode(STDOUT,':utf8'); # this to make 'print' output utf-8 bytes my $a="O\N{COMBINING DIAERESIS}"; my $b=NFC($a); print length($a),$a,"\n"; print length($b),$b,"\n";

    Will print:

    2Ö 1

    (more or less, depending on PM's escaping mechanisms)

    -- 
            dakkar - Mobilis in mobile
    

    Most of my code is tested...

    Perl is strongly typed, it just has very few types (Dan)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://430672]
Approved by friedo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2022-05-16 22:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (64 votes). Check out past polls.

    Notices?