Fixing broken character encoding

pfaut has asked for the wisdom of the Perl Monks concerning the following question:

Is it possible to use perl to fix broken HTML character encoding?

I am downloading RSS data from a site and it appears that it was created with a broken program. It claims to be UTF-8 but I believe it should have been ISO8859-1. I see things in the text stream that look like â which should translate to an apostrophe. I think something grabbed the bytes, converted them to HTML entities and then claimed the result was UTF-8. I don't know enough about character encoding or the perl modules to manipulate encoding to figure out how I might convert this back to something that displays correctly in a browser.

I've already complained to the site admins but they haven't fixed the RSS generator yet and I don't suppose they will any time soon.

90% of every Perl application is already written. ⇒

dragonchild

Comment on Fixing broken character encoding Download Code

Replies are listed 'Best First'.

Re: Fixing broken character encoding
by moritz (Cardinal) on Jul 26, 2012 at 04:21 UTC

Maybe Encode::Repair can help you?

Perl 6 - the future is here, just unevenly distributed

[reply]

Re: Fixing broken character encoding
by Anonymous Monk on Jul 26, 2012 at 03:02 UTC

It is possible if its been malformed once (single step), multiple iterations can be impossible.

IIRC I think http://validator.w3.org/ can help ( Bundle::W3C::Validator )

As can these
HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents
Encode::Detective - detect a data encoding
Encoding::FixLatin - takes mixed encoding input and produces UTF-8 output
Encode::DoubleEncodedUTF8 - Fix double encoded UTF-8 bytes to the correct one

But you ought to post some minimal html

I figure it ought to be as simple as parsing the file, decoding the entities, treating the string as octets and deciding what charset it is

[reply]

Re^2: Fixing broken character encoding

by Anonymous Monk on Jul 26, 2012 at 04:04 UTC

This could work

#!/usr/bin/perl --
use strict; use warnings;
use Data::Dump        qw' dd                              ';
use HTML::Entities    qw' encode_entities decode_entities ';
use Encode            qw' encode decode                   ';
use Encode::Detective qw' detect                          ';

my $odata = my $str = '&acirc;&#128;&#153;';

decode_entities($str);
dd $str;
dd encode_entities($str);

my $encoding = detect($str);
dd $encoding;

$str = decode( 'UTF-8', $str );
dd $str;
dd encode_entities($str);

__END__
"\xE2\x80\x99"
"&acirc;&#128;&#153;"
"UTF-8"
"\x{2019}"
"&rsquo;"
[download]

[reply]
[d/l]

Re^3: Fixing broken character encoding

by pfaut (Priest) on Jul 26, 2012 at 10:25 UTC

That's showing some promise. Thank you.

90% of every Perl application is already written. ⇒

dragonchild

[reply]


XP is just a number
	PerlMonks