Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

character encoding & french accents

by dstamos (Initiate)
on Feb 02, 2006 at 21:10 UTC ( #527439=perlquestion: print w/replies, xml ) Need Help??

dstamos has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I hope that someone can help me with this. I am not a perl programmer but rather the owner of a purchased perl (cgi) script.

I am working with the programmer on this but we are having a hard time figuring out where the problem is exactly (local, cgi script or smtp).

The application is mailing list script that connects to a smtp server to send the email. The emails are written in french with accents. When we receive the email it is garbled in our email program and webmail. Changing character encoding in our browser or gui to UTF-8 makes the accents readable. The standard in browsers and gui's would seem to be iso-8559-1.

This script writes the email to a text file first and then sends it. Its running on a Fedora Core 3 server with Apache 2.0. Is the problem with the way the file is encoded ? or is it a local problem on the server ?

Something i noticed is that when i vi the text file on the server the french accents are garbled.

does anyone have experience with this ?

regards

Denny

Replies are listed 'Best First'.
Re: character encoding & french accents
by rhesa (Vicar) on Feb 02, 2006 at 21:45 UTC
    I have plenty experience with mismatching character encodings, and I agree that it can be hard to track down.

    First of all, where does your source document come from? Which factors determine its encoding?

    Ultimately the best course of action is to ensure that you're using utf8 everywhere (and advertising that to the viewing programs). You can tell the web browser that your content is in utf8 by sending an additional "charset" header attribute like this:

    use CGI; my $cgi = CGI->new; ... print $cgi->header(-type=>'text/html', -charset=>'utf-8');
    You can do the same for emails, by giving them the appropriate MIME headers:
    Content-Type: text/plain; charset="utf-8"
    How you do that depends on how your email-sending code is designed.

    I'd like to point out that in recent perls (version 5.8), under most circumstances, strings are encoded internally as utf8. So it makes sense to be consistent about that in the rest of your application.

      Thank you to rhesa, graff & fraktalisman for your constructive input. "First of all, where does your source document come from? Which factors determine its encoding?" The content comes from an input box (form) and the cgi program writes this to a text file. What happens after that i dont know. I was able to modify the script to insert a MIME header as you suggested but it did not change anything. I dont know what determines its encoding. I think im going to have to get the programmer involved here and come back with some code and more info. thank you again to everyone.
        I think getting your developer in here to discuss the details is a very good idea.

        I've found on several occasions that it's necessary to manually upgrade form input to utf8. For some reason, CGI returns raw byte strings, and those might end up being upgraded to utf once more, resulting in lots of squiggly characters. I did this like so:

        use Encode; my $email_body = $cgi->param( 'email_body' ); $email_body = decode_utf8( $email_body );
        After this, most conversions and display issues are a snap.
Re: character encoding & french accents
by graff (Chancellor) on Feb 03, 2006 at 02:14 UTC
    thesa's advice is basically correct, but your point iso-8859-1 being the "default" encoding for many apps is also relevant. It's true that more and more apps (esp. browsers and browser-based email clients) are making it easier to change character encoding at the display as needed, but it's also true that a lot of them (and the people who use them) still treat the legacy iso-8859 and cp12.. as the "default".

    If you would prefer your email output to be iso-8859-1, use the Encode module on the text data, or PerlIO control on the output file handle, in order to assure that the text is written with that encoding; assuming the email text is going out via a file handle the easiest thing would probably be:

    binmode $fh, ":encoding(iso-8859-1)"; # convert internal utf8 to lati +n1 on output
    If you do this, you should still heed thesa's advice, and explicitly declare what character set you're using in the MIME header.
Re: character encoding & french accents
by kutsu (Priest) on Feb 02, 2006 at 21:45 UTC

    You'll need to set the charset as unicode (you can do this with a meta tag in the head <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">) btw. you can find more info at unicode.org (check out the FAQs).

Re: character encoding & french accents
by fraktalisman (Hermit) on Feb 03, 2006 at 13:38 UTC

    I've had similar problems recently when upgrading an existing content management system from Latin-1 to UTF-8. After extensive testing we have decided to display all website content in UTF-8 encoding, but send emails in Latin-1, because many freemail websites seem to ignore the charset directives.

    If your provider happens to run an older Perl version, the Encode module might not support UTF-8 or it might not be there at all. In that case, have a look at the sub latin1 in the following example: Converting character encodings.

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://527439]
Approved by Errto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2021-04-19 16:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?