Re^3: UTF-8: Trying to make sense of form input

in reply to Re^2: UTF-8: Trying to make sense of form input
in thread UTF-8: Trying to make sense of form input

Yes, it can be useful in debugging when the flag matters. In this case, it only served to be a distraction. Thinking in terms of the UTF8 flag is the wrong way to go. Thinking in terms of encoded or not would have avoided all his problems.

param returns encoded chars.
decoded_entities accepts decoded chars.
decoded_entities returns decoded chars.
print without :encoding accepts encoded chars.

Therefore, he needs to decode what param returns and encode what he prints.

Using is_utf8 gives an idea whether the characters are decoded or not, but it's not reliable. In fact, it's specifically unreliable with decoded_entities since the string decoded_entities returns can have either state for the UTF8 flag. Documentation and Hungarian Notation are better tools here than is_utf8.

Update: Fixed ambiguous pronouns. Fixed bad grammar. Fixed formatting.

Comment on Re^3: UTF-8: Trying to make sense of form input Select or Download Code

Replies are listed 'Best First'.
Re^4: UTF-8: Trying to make sense of form input by creamygoodness (Curate) on Aug 16, 2009 at 05:41 UTC
I think you're right that the OP needs to grasp the mental model you've laid out. But I predict that until the OP masters debugging the encoding -- which requires understanding the role of the `UTF8` flag -- problems are going to keep cropping up. If there were an "encoded/decoded" flag that you could check, that would be lovely. Since no such flag exists, you need to be able to look at the raw string and the presence/absence of the `UTF8` flag in Devel::Peek to see what's going wrong. There are simply too many opportunities to mess up. Forget a `binmode()` here, omit (or include) a `-utf8` argument there, forget to set `pg_enable_utf8` on your DBD::Pg db handle, pass something through YAML::Syck without setting $YAML::Syck::ImplicitUnicode, and so on. In short... documentation and Hungarian notation are too unreliable :) -- because the underlying system is too hard to control from a high level. IMO, the only way to achieve high reliability for UTF-8 is to write tests. `use Test::More tests => 1; my $smiley = "\x{263a}; my $maybe = round_trip($smiley); is( $maybe, $smiley, "String survives round trip including UTF8 flag" );` [download] PS: You updated your node multiple times over the half hour or so after it was posted, forcing me to keep rewriting my reply. :(	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^4: UTF-8: Trying to make sense of form input
by creamygoodness (Curate) on Aug 16, 2009 at 05:41 UTC

I think you're right that the OP needs to grasp the mental model you've laid out.

But I predict that until the OP masters debugging the encoding -- which requires understanding the role of the UTF8 flag -- problems are going to keep cropping up. If there were an "encoded/decoded" flag that you could check, that would be lovely. Since no such flag exists, you need to be able to look at the raw string and the presence/absence of the UTF8 flag in Devel::Peek to see what's going wrong.

There are simply too many opportunities to mess up. Forget a binmode() here, omit (or include) a -utf8 argument there, forget to set pg_enable_utf8 on your DBD::Pg db handle, pass something through YAML::Syck without setting $YAML::Syck::ImplicitUnicode, and so on.

In short... documentation and Hungarian notation are too unreliable :) -- because the underlying system is too hard to control from a high level.

IMO, the only way to achieve high reliability for UTF-8 is to write tests.

use Test::More tests => 1;

my $smiley = "\x{263a};
my $maybe = round_trip($smiley);
is( $maybe, $smiley, 
    "String survives round trip including UTF8 flag" );
[download]

PS: You updated your node multiple times over the half hour or so after it was posted, forcing me to keep rewriting my reply. :(

[reply]
[d/l]
[select]

In Section Seekers of Perl Wisdom