Re^2: Lost in encodings

in reply to Re: Lost in encodings
in thread Lost in encodings

And thanks once more for your long and helpful reply. I tested now a bit more, adopting your tip to use decoded_content.

So when I look now what is read by LWP I really get the correct Umlaut which I also can see when I set binmode on the debugger's IO.

The problem lies in the output of MIME::Lite::TT::HTML it seems. Looking at the code, it seems one can provide input and output charset. When you don't, MIME::Lite::TT::HTML assumes you already provide the correct charset :( So what I would need to do is provide the Charset of the internal perl strings - which doesn't exist I assume. I think I'll have to patch MIME::Lite::TT::HTML…

As you wrote:

Now when you write the data, you need to encode it to UTF-8. I suppose (but didn't test right now) that MIME::Lite::TT::HTML does the right thing and encodes for you if you provide the Charset attribute on the constructor. =FC is QP-encoding for an ISO-8859-1 'ü' and indeed wrong here. So if you did provide Charset => 'utf8', then shout up, I'll write some tests.

So here is my shout out. ;)

I assume the relevant part which needs to be patched is this https://metacpan.org/release/MIME-Lite-TT-HTML/source/lib/MIME/Lite/TT/HTML.pm Line 115-117:

    $charset = [ $charset ] unless ref $charset eq 'ARRAY';
        my $charset_input  = shift @$charset || 'US-ASCII';
        my $charset_output = shift @$charset || $charset_input;
[download]

Here I would provide "something" for the internal perl encoding. Maybe '*internal*'?.

Starting line 156, the code looks dubious. "remove_utf8_flag" does not seem correct. after what I learned from you and others in my threads.

And then the

from_to

encoding should be changed I guess to:

if ($charset_input ne $charset_output) {
    my $perl_string= $charset_input eq '*internal*'
        ? $string
        : Encode::decode($charste_input, $string);
    $string= Encode::encode($charset_output, $perl_string);
}
[download]

What do you think?

Update I've created a patch which allows one to tell MIME::Lite::TT::HTML that text provided ($charset_input) is internal perl representation. With this in place, my script works as expected.

Unfortunately it seems the module is abandoned as the issues opened for it are 12 years old :(

s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Comment on Re^2: Lost in encodings Select or Download Code

Replies are listed 'Best First'.
Re^3: Lost in encodings by haj (Vicar) on Feb 10, 2020 at 15:41 UTC
That's a good job in tracking that down to the root cause! When I wrote my previous response, I failed to check the version history of MIME::Lite::TT::HTML. Otherwise I would not made the assumption that the module does the right thing. It does not, as you found out. The current release is from 2007 (Perl 5.10-ish), so Unicode support was not only rather new and sometimes bumpy in Perl, but also module authors didn't have much experience with it, nor did all CPAN modules support it. After having looked into the module's source code: The module works with all input in byte-encoded form. Today this is considered bad practice since it breaks a lot of Perl's string processing features, including those available from Template Toolkit. The module also assumes that the subject is encoded, in the same encoding as the template files, which is even more questionable. So yes, patching (or subclassing) the module's methods `encode_subject` and `encode_body` would be the way to go. Filing an issue for the module would also be fine, but according to the current list of open issues it doesn't look like the auther is still actively maintaining the module. There is no keyword for Perl's internal encoding (because, by definition, these strings are decoded). So you could either invent one like `internal` or even us an undefined value as an indicator that your input should not be decoded. Your fix should do the trick if you want to go that path. `remove_utf8_flag` is indeed scary and another example of an attempt to achieve cancellation of errors. I am pretty sure that TT processing could result in this flag being set, even if the TT results are pure ASCII. Instead of re-evaluating his assumptions, the author just killed the flag to make the string fit his expectations. With current Perl you wouldn't get rid of the flag like that, and `Encode::decode` will happily decode strings which already have the flag set. Another alternative with more coding, but better alignment with current practice would be to get rid of `$charset_input` and expect that the subject and the template parameters are Perl strings. You'd still need TT's `ENCODING` config because UTF-8 text in files needs decoding, and `$charset_output` is also still required because MIME::Lite explicitly says that it expects encoded strings.	[reply]

Replies are listed 'Best First'.

Re^3: Lost in encodings
by haj (Vicar) on Feb 10, 2020 at 15:41 UTC

That's a good job in tracking that down to the root cause!

When I wrote my previous response, I failed to check the version history of MIME::Lite::TT::HTML. Otherwise I would not made the assumption that the module does the right thing. It does not, as you found out. The current release is from 2007 (Perl 5.10-ish), so Unicode support was not only rather new and sometimes bumpy in Perl, but also module authors didn't have much experience with it, nor did all CPAN modules support it.

After having looked into the module's source code: The module works with all input in byte-encoded form. Today this is considered bad practice since it breaks a lot of Perl's string processing features, including those available from Template Toolkit. The module also assumes that the subject is encoded, in the same encoding as the template files, which is even more questionable. So yes, patching (or subclassing) the module's methods encode_subject and encode_body would be the way to go. Filing an issue for the module would also be fine, but according to the current list of open issues it doesn't look like the auther is still actively maintaining the module.

There is no keyword for Perl's internal encoding (because, by definition, these strings are decoded). So you could either invent one like *internal* or even us an undefined value as an indicator that your input should not be decoded. Your fix should do the trick if you want to go that path.

remove_utf8_flag is indeed scary and another example of an attempt to achieve cancellation of errors. I am pretty sure that TT processing could result in this flag being set, even if the TT results are pure ASCII. Instead of re-evaluating his assumptions, the author just killed the flag to make the string fit his expectations. With current Perl you wouldn't get rid of the flag like that, and Encode::decode will happily decode strings which already have the flag set.

Another alternative with more coding, but better alignment with current practice would be to get rid of $charset_input and expect that the subject and the template parameters are Perl strings. You'd still need TT's ENCODING config because UTF-8 text in files needs decoding, and $charset_output is also still required because MIME::Lite explicitly says that it expects encoded strings.

[reply]

In Section Seekers of Perl Wisdom