http://qs321.pair.com?node_id=692606


in reply to CGI.pm and encoding HTML entities in param()

This is probably not CGI's doing but your browser's. For instance, if the charset of your pages is iso-8859-1, and you enter a non-latin1 character (like ā) into a text field, Firefox will represent the character in entity form (ā). This is essentially the best it can do since there is no way to represent a non-latin1 character in the latin1 encoding. This situation is explained well in the following article:

Character Conversions from Browser to Database

As for CGI, the values returned by param() are byte strings, not code-point strings. Due to the way the web standards evolved there just isn't enough information in the request to convert the parameter values to code-points. So this is something your application has to do based on what it knows about the encoding of the forms and web pages that will be calling it.

This thread sheds some additional light on the problem: CGI::Application - Which is the proper way of handling and outputting utf8. As Juerd notes, it would be helpful if CGI was (or could be made) encoding aware so that parameter values could automatically be passed through a decoding function.

A good way to help avoid character encoding problems is to 1) always explicitly specify the charset of your pages, and 2) settle on one encoding that can handle everything, e.g. UTF-8.

Replies are listed 'Best First'.
Re^2: CGI.pm and encoding HTML entities in param()
by skazat (Chaplain) on Jun 23, 2008 at 03:47 UTC
    My Goodness, I think you're right. Thanks for such an eloquent reply! Charsets And Encoding aren't my favorite gremlins to attempt to solve, especially in 8+ year code! - s