in reply to Re: UTF-8: Trying to make sense of form input
in thread UTF-8: Trying to make sense of form input

In my opinion, it's futile to troubleshoot UTF-8 issues without understanding the underlying implementation and keeping track of the SVf_UTF8 flag, using Encode::is_utf8() when convenient and Devel::Peek's Dump() when necessary.

The interface for Encode::is_utf8() is dreadful, but it's better than flailing in the dark.

Replies are listed 'Best First'.
Re^3: UTF-8: Trying to make sense of form input
by ikegami (Patriarch) on Aug 16, 2009 at 04:25 UTC

    Yes, it can be useful in debugging when the flag matters. In this case, it only served to be a distraction. Thinking in terms of the UTF8 flag is the wrong way to go. Thinking in terms of encoded or not would have avoided all his problems.

    • param returns encoded chars.
    • decoded_entities accepts decoded chars.
    • decoded_entities returns decoded chars.
    • print without :encoding accepts encoded chars.

    Therefore, he needs to decode what param returns and encode what he prints.

    Using is_utf8 gives an idea whether the characters are decoded or not, but it's not reliable. In fact, it's specifically unreliable with decoded_entities since the string decoded_entities returns can have either state for the UTF8 flag. Documentation and Hungarian Notation are better tools here than is_utf8.

    Update: Fixed ambiguous pronouns. Fixed bad grammar. Fixed formatting.

      I think you're right that the OP needs to grasp the mental model you've laid out.

      But I predict that until the OP masters debugging the encoding -- which requires understanding the role of the UTF8 flag -- problems are going to keep cropping up. If there were an "encoded/decoded" flag that you could check, that would be lovely. Since no such flag exists, you need to be able to look at the raw string and the presence/absence of the UTF8 flag in Devel::Peek to see what's going wrong.

      There are simply too many opportunities to mess up. Forget a binmode() here, omit (or include) a -utf8 argument there, forget to set pg_enable_utf8 on your DBD::Pg db handle, pass something through YAML::Syck without setting $YAML::Syck::ImplicitUnicode, and so on.

      In short... documentation and Hungarian notation are too unreliable :) -- because the underlying system is too hard to control from a high level.

      IMO, the only way to achieve high reliability for UTF-8 is to write tests.

      use Test::More tests => 1; my $smiley = "\x{263a}; my $maybe = round_trip($smiley); is( $maybe, $smiley, "String survives round trip including UTF8 flag" );

      PS: You updated your node multiple times over the half hour or so after it was posted, forcing me to keep rewriting my reply. :(