comment on

Hi All,
After hitting issues with form input that contained no ASCII characters, such as £ I wrote a QnD script to try ans understand what is going on. I'm afraid I still don't fully understand :/
Code for the script is included at the bottom, it'll run on Linux or Windows, Apache/IIS/Others.
As far as I understand it:-

The form input is being encoded as UTF-8 by the browser as the server has set a UTF-8 charset in it's headers
When Perl CGI.pm picks it up, it has no idea it's UTF-8
If it gets saved straight out to a file it'll still be in UTF-8 although the file itself may not
If decoded with Encode.pm Perl will flag it as being UTF-8, but convert to it's own internal format
If encoded with Encode.pm Perl will NOT flag it as being UTF-8, it'll actually be double encoded
If you try to manipulate a UTF-8 string that hasn't been decoded, such as with a regexp, strange things might happen

Given this, I decided to use HTML::Entities to convert characters such as £ to £. This is where things got more confusing. The output of my test script is:-

Input: Ł (IS UTF8? No)
Decoded: ? (IS UTF8? Yes)
Encoded: ÂŁ (IS UTF8? No)
Entities input: ÂŁ
Entities decoded: Ł
Entities encoded: Ă‚ÂŁ
[download]

If I print the input straight back out it comes out as a normal £ as expected, if decoded it gets an unrecognised character symbol, encoded it has the tell tale Â appear. But if I pass it through HTML::Entities, the input get's the Â and the decoded one comes out right?? The encoded one, well that comes out even wierder.
On top of this, if you write these out to a file, and view using nano or vi you see:-

Input: ÂŁ
Decoded: ÂŁ
Encoded: Ă‚ÂŁ
[download]

Which didn't make sense to me, I expected the decoded one to be just Ł. But when I tested this script on Win32 IIS, i got:-

Input: ÂŁ
Decoded: Ł
Encoded: Ă‚ÂŁ
[download]

Which is what I expected???

Maybe a UTF-8 expert could explain this? It might make a good reference.

Test script:-

#!/usr/bin/perl
use strict;

BEGIN {
    print "content-type: text/html; charset=UTF-8\n\n";
    use FindBin qw ($RealBin $RealScript);
    use lib $FindBin::RealBin;
    chdir $RealBin;
}#BEGIN

use CGI;
my $cgi = new CGI;

print qq~
<form method=POST>
input: <input type=text name=string value="${ \$cgi->param('string') }
+">
<input type=submit>
</form>
~;


if ( $cgi->param('string') ) {
    use Encode qw( is_utf8 encode decode );
    print "Input: ${ \$cgi->param('string') } (IS UTF8? ";
    if ( is_utf8($cgi->param('string')) ) { print "Yes)<br>\n"; }
    else { print "No)<br>\n"; }
    my $string = decode("utf8", $cgi->param('string'));
    print "Decoded: $string (IS UTF8? ";
    if ( is_utf8($string) ) { print "Yes)<br>\n"; }
    else { print "No)<br>\n"; }
    my $octets = encode("utf8", $cgi->param('string'));
    print "Encoded: $octets (IS UTF8? ";
    if ( is_utf8($octets) ) { print "Yes)<br>\n"; }
    else { print "No)<br>\n"; }

    open( OUTF, '>utf8.txt' ) || print("Error writing file");
        print OUTF "Input: ${ \$cgi->param('string') }\n";
        print OUTF "Decoded: $string\n";
        print OUTF "Encoded: $octets\n";
    close( OUTF );

    use HTML::Entities;
    my $ent_input = encode_entities($cgi->param('string'));
    print "Entities input: $ent_input<br>\n";
    my $ent_decode = encode_entities($string);
    print "Entities decoded: $ent_decode<br>\n";
    my $ent_encode = encode_entities($octets);
    print "Entities encoded: $ent_encode<br>\n";
}#if
[download]

Lyle

Update: Thanks everyone for the replies :)

In reply to UTF-8: Trying to make sense of form input by cosmicperl

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks