Re: UTF-8: Trying to make sense of form input

Don't look at is_utf8. That's going down the wrong path.

Given this, I decided to use HTML::Entities to convert characters such as £ to £. This is where things got more confusing.

If param foo is encoded using UTF-8 and consists of text with HTML entities, you want

my $text = decode_entities(decode('UTF-8', $cgi->param('foo')));
[download]

Don't forget to encode the result if you output it in part of full (using encode or binmode :encoding on the output handle).

Comment on Re: UTF-8: Trying to make sense of form input Select or Download Code

Replies are listed 'Best First'.
Re^2: UTF-8: Trying to make sense of form input by creamygoodness (Curate) on Aug 16, 2009 at 03:58 UTC
In my opinion, it's futile to troubleshoot UTF-8 issues without understanding the underlying implementation and keeping track of the `SVf_UTF8` flag, using `Encode::is_utf8()` when convenient and Devel::Peek's `Dump()` when necessary. The interface for `Encode::is_utf8()` is dreadful, but it's better than flailing in the dark.	[reply] [d/l] [select]
Re^3: UTF-8: Trying to make sense of form input by ikegami (Patriarch) on Aug 16, 2009 at 04:25 UTC
Yes, it can be useful in debugging when the flag matters. In this case, it only served to be a distraction. Thinking in terms of the UTF8 flag is the wrong way to go. Thinking in terms of encoded or not would have avoided all his problems. `param` returns encoded chars. `decoded_entities` accepts decoded chars. `decoded_entities` returns decoded chars. `print` without `:encoding` accepts encoded chars. Therefore, he needs to decode what `param` returns and encode what he prints. Using `is_utf8` gives an idea whether the characters are decoded or not, but it's not reliable. In fact, it's specifically unreliable with `decoded_entities` since the string `decoded_entities` returns can have either state for the `UTF8` flag. Documentation and Hungarian Notation are better tools here than `is_utf8`. Update: Fixed ambiguous pronouns. Fixed bad grammar. Fixed formatting.	[reply] [d/l] [select]
Re^4: UTF-8: Trying to make sense of form input by creamygoodness (Curate) on Aug 16, 2009 at 05:41 UTC
I think you're right that the OP needs to grasp the mental model you've laid out. But I predict that until the OP masters debugging the encoding -- which requires understanding the role of the `UTF8` flag -- problems are going to keep cropping up. If there were an "encoded/decoded" flag that you could check, that would be lovely. Since no such flag exists, you need to be able to look at the raw string and the presence/absence of the `UTF8` flag in Devel::Peek to see what's going wrong. There are simply too many opportunities to mess up. Forget a `binmode()` here, omit (or include) a `-utf8` argument there, forget to set `pg_enable_utf8` on your DBD::Pg db handle, pass something through YAML::Syck without setting $YAML::Syck::ImplicitUnicode, and so on. In short... documentation and Hungarian notation are too unreliable :) -- because the underlying system is too hard to control from a high level. IMO, the only way to achieve high reliability for UTF-8 is to write tests. `use Test::More tests => 1; my $smiley = "\x{263a}; my $maybe = round_trip($smiley); is( $maybe, $smiley, "String survives round trip including UTF8 flag" );` [download] PS: You updated your node multiple times over the half hour or so after it was posted, forcing me to keep rewriting my reply. :(	[reply] [d/l] [select]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks