http://qs321.pair.com?node_id=11112585


in reply to Lost in encodings

I was sort of expecting this to come as a follow-up to your previous article but didn't want to overcomplicate things :)

One of the issues with encoding is that it happens in so many places that quite often things look right while you actually have a cancellation of errors. Your example is no exception. So here are some points:

So, what's happening here? You read data with LWP. Though you haven't given the details, I guess you are using the content method to retrieve the data. This method always gives bytes. But wait: LWP can use the charset attribute from the Content-Type header to decode text into characters, and indeed it will do so if you use the method decoded_content method instead.

The data is displayed correctly because you are feeding non-decoded bytes to a terminal which expects UTF-8-encoded bytes. Since your input was also UTF-8-encoded bytes, it looks fine. Your application is just a man in the middle which passes these data through.

Decoding the data is the correct way (which, as I wrote, LWP can do for you if you want). Perl then knows that the character in question is a 'ü'. Perl can handle this character in its default encoding, which is slightly infortunate, because it will do so and print one Byte for that character. This character hits a terminal which expects UTF-8 encoded bytes, doesn't understand the character and substitutes it with the Unicode replacement character.

Now when you write the data, you need to encode it to UTF-8. I suppose (but didn't test right now) that MIME::Lite::TT::HTML does the right thing and encodes for you if you provide the Charset attribute on the constructor. =FC is QP-encoding for an ISO-8859-1 'ü' and indeed wrong here. So if you did provide Charset     => 'utf8', then shout up, I'll write some tests.

As for handling the debugger: Since you are working with an UTF-8 terminal, you might want to try the following:

binmode DB::OUT,':utf8'; binmode DB::IN,':utf8'

This makes the debugger handle its I/O as UTF-8 encoded.

....and, because I just read the reply by LanX, I recommend against Data::Peek. It will tell you only what you already know ("that's not right") but not give guidance how to fix.