Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hi All,
  After hitting issues with form input that contained no ASCII characters, such as £ I wrote a QnD script to try ans understand what is going on. I'm afraid I still don't fully understand :/
  Code for the script is included at the bottom, it'll run on Linux or Windows, Apache/IIS/Others.
  As far as I understand it:-
  • The form input is being encoded as UTF-8 by the browser as the server has set a UTF-8 charset in it's headers
  • When Perl CGI.pm picks it up, it has no idea it's UTF-8
  • If it gets saved straight out to a file it'll still be in UTF-8 although the file itself may not
  • If decoded with Encode.pm Perl will flag it as being UTF-8, but convert to it's own internal format
  • If encoded with Encode.pm Perl will NOT flag it as being UTF-8, it'll actually be double encoded
  • If you try to manipulate a UTF-8 string that hasn't been decoded, such as with a regexp, strange things might happen
Given this, I decided to use HTML::Entities to convert characters such as £ to £. This is where things got more confusing. The output of my test script is:-
Input: £ (IS UTF8? No) Decoded: ? (IS UTF8? Yes) Encoded: £ (IS UTF8? No) Entities input: £ Entities decoded: £ Entities encoded: £
If I print the input straight back out it comes out as a normal £ as expected, if decoded it gets an unrecognised character symbol, encoded it has the tell tale  appear. But if I pass it through HTML::Entities, the input get's the  and the decoded one comes out right?? The encoded one, well that comes out even wierder.
  On top of this, if you write these out to a file, and view using nano or vi you see:-
Input: £ Decoded: £ Encoded: £
Which didn't make sense to me, I expected the decoded one to be just £. But when I tested this script on Win32 IIS, i got:-
Input: £ Decoded: £ Encoded: £
Which is what I expected???

Maybe a UTF-8 expert could explain this? It might make a good reference.

Test script:-
#!/usr/bin/perl use strict; BEGIN { print "content-type: text/html; charset=UTF-8\n\n"; use FindBin qw ($RealBin $RealScript); use lib $FindBin::RealBin; chdir $RealBin; }#BEGIN use CGI; my $cgi = new CGI; print qq~ <form method=POST> input: <input type=text name=string value="${ \$cgi->param('string') } +"> <input type=submit> </form> ~; if ( $cgi->param('string') ) { use Encode qw( is_utf8 encode decode ); print "Input: ${ \$cgi->param('string') } (IS UTF8? "; if ( is_utf8($cgi->param('string')) ) { print "Yes)<br>\n"; } else { print "No)<br>\n"; } my $string = decode("utf8", $cgi->param('string')); print "Decoded: $string (IS UTF8? "; if ( is_utf8($string) ) { print "Yes)<br>\n"; } else { print "No)<br>\n"; } my $octets = encode("utf8", $cgi->param('string')); print "Encoded: $octets (IS UTF8? "; if ( is_utf8($octets) ) { print "Yes)<br>\n"; } else { print "No)<br>\n"; } open( OUTF, '>utf8.txt' ) || print("Error writing file"); print OUTF "Input: ${ \$cgi->param('string') }\n"; print OUTF "Decoded: $string\n"; print OUTF "Encoded: $octets\n"; close( OUTF ); use HTML::Entities; my $ent_input = encode_entities($cgi->param('string')); print "Entities input: $ent_input<br>\n"; my $ent_decode = encode_entities($string); print "Entities decoded: $ent_decode<br>\n"; my $ent_encode = encode_entities($octets); print "Entities encoded: $ent_encode<br>\n"; }#if

Lyle

Update: Thanks everyone for the replies :)

In reply to UTF-8: Trying to make sense of form input by cosmicperl

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2024-04-19 11:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found