skazat has asked for the wisdom of the Perl Monks concerning the following question:

I'm finding that vars that I pull from the's param method are encoded, as if they've been run through HML::Entities. Ala:

my $foo = $q->param('foo');

I couldn't find the entity encoding spec'd in the docs - is this something new? This is becoming a problem, as I'm double-encoding my HTML::Entities. I've been working with for about 9 years now and this is surprising to find the behavior change.

I'm having a hard time making a simple example from the large behemoth of a problem that's showing this problem, but I will continue to try to get one :)

Replies are listed 'Best First'.
Re: and encoding HTML entities in param()
by pc88mxer (Vicar) on Jun 18, 2008 at 01:29 UTC
    This is probably not CGI's doing but your browser's. For instance, if the charset of your pages is iso-8859-1, and you enter a non-latin1 character (like ā) into a text field, Firefox will represent the character in entity form (ā). This is essentially the best it can do since there is no way to represent a non-latin1 character in the latin1 encoding. This situation is explained well in the following article:

    Character Conversions from Browser to Database

    As for CGI, the values returned by param() are byte strings, not code-point strings. Due to the way the web standards evolved there just isn't enough information in the request to convert the parameter values to code-points. So this is something your application has to do based on what it knows about the encoding of the forms and web pages that will be calling it.

    This thread sheds some additional light on the problem: CGI::Application - Which is the proper way of handling and outputting utf8. As Juerd notes, it would be helpful if CGI was (or could be made) encoding aware so that parameter values could automatically be passed through a decoding function.

    A good way to help avoid character encoding problems is to 1) always explicitly specify the charset of your pages, and 2) settle on one encoding that can handle everything, e.g. UTF-8.

      My Goodness, I think you're right. Thanks for such an eloquent reply! Charsets And Encoding aren't my favorite gremlins to attempt to solve, especially in 8+ year code! - s
Re: and encoding HTML entities in param()
by almut (Canon) on Jun 17, 2008 at 22:01 UTC

    What does the respective raw query look like when you encounter the problem, i.e. $ENV{QUERY_STRING} with GET requests? In other words, are you sure it's that's responsible for the encoding?

    Also, as you describe things, it sounds as if this is a new phenomenon. So, has there been a version change of lately? Which version is it, anyway?