Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

UTF-8 from a CGI script

by PerlRudi (Initiate)
on Oct 22, 2005 at 02:41 UTC ( [id://502159]=perlquestion: print w/replies, xml ) Need Help??

PerlRudi has asked for the wisdom of the Perl Monks concerning the following question:

I recently noticed that my perl CGI script for reading an xml file and rendering some of its contents to the browser as HTML was not correctly rendering some foreign language characters. In the course of troubleshooting the issue, I created a simple test page but that appears to have the same issue with the foreign characters such as the e accent aigu appearing correctly in the browser.
#!/usr/bin/perl -w #use strict; use CGI::Carp "fatalsToBrowser"; use CGI; use POSIX qw(ceil floor); use Cwd; use Template; use File::Basename; # Write output immediately $|=1; my $query=new CGI; my $form=new CGI; use XML::XSLT; print <<HTML; Content-Type: text/html; charset=utf-8 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <html> HTML use Encode 'is_utf8'; my $u_temp = "Temperature:350\x{00B0}F html:&deg;"; #print 'Is string utf8?[' . utf8::is_utf8($u_temp) . ']<BR>'; print is_utf8($u_temp) ? 1 : 0, "<BR>"; print $u_temp . '<BR>'; my $price_label = "Price:\x{20AC}9.99"; print $price_label . '<BR>'; my $smiley = "Smiley:\x{263a}"; print $smiley . '<BR>'; my $french_word = "Some French Word: Saut\x{00E9} html:&eacute;"; print $french_word . '<BR>'; print '<HR>';
On my browser the euro symbol, the smiley appear and all the html rendered characters appear correctly but the unicode degree and accented e do not appear correctly. The is_utf8() returns 0. I first suspected the character encoding of the html but my browser does appear to understand the encoding is utf-8. Any insight as to if my test should work or what else I should be looking at to successfully read a unicode xml file using the xml DOM and render it to the browser would be appreciated. I believe everything was working fine but my hosting provider recently upgraded to version 5.8. Thanks in advance.

Replies are listed 'Best First'.
Re: UTF-8 from a CGI script
by Errto (Vicar) on Oct 22, 2005 at 03:05 UTC
    try
    binmode STDOUT, ':utf8';
    before your print statement. Update: to see why this matters, before doing that try adding use warnings; and checking your server log. You'll see warnings to the effect of "Wide character in print at..." which means that Perl does not know how to output the Unicode characters because STDOUT is currently using some single-byte encoding (usually ISO-8859-1). So you need to explicitly tell it to use UTF-8 instead.
Re: UTF-8 from a CGI script
by fizbin (Chaplain) on Oct 22, 2005 at 03:24 UTC

    In addition to what was already mentioned, be sure that when you open the xml file you open it in utf-8 mode too, if you're doing the open yourself. (If you're using an xml-specific module to which you pass a filename or url, you can probably trust the module to take care of this) By default it will probably be opened as iso-latin-1, not utf-8.

    --
    @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
Re: UTF-8 from a CGI script
by ioannis (Abbot) on Oct 22, 2005 at 04:19 UTC
    A few observations regarding your utf-8 issues:

    • The reason the is_utf8() check failed is because of Perl's backward compatability issues. For hex values of "\x{00FF}" or less, the is_utf8() check is supposed to fail. Here are are a few examples using your variables:
    • binmode \*STDOUT, ':utf8'; my $u_temp = "Temperature:350\x{00B0}F html:&deg;"; my $smiley = "Smiley:\x{263a}"; my $price_label = "Price:\x{20AC}9.99"; print 'is degree' if utf8::is_utf8( $u_temp ); print 'is smiley' if utf8::is_utf8( $smiley ); print 'is price label' if utf8::is_utf8( $price_label );
    • Others have already posted about binmode(), or 'use open' layers.
    • In addition to Perl related issues, also ensure that your fonts for X (or Linux console), as well as your browser (or xterm, or cat(1) ) are also able to display in utf8 charrs using utf-8 fonts. (I use LatCyrGr-16.psf fonts for Linux console.)

      In addition to ioannis's third point, be aware of the consequences of having UTF8 chars in your script.

      I was trapped by this until tye came to the rescue. :-)

Re: UTF-8 from a CGI script
by Anonymous Monk on Oct 24, 2005 at 13:10 UTC
    Thank you to everyone for the help on the issue. This was great. In my case Errto's pointing out the situation with STDOUT solved the issue but all the information will be helpful as I proceed. You guys rock.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://502159]
Approved by monkfan
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-26 05:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found