Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I’m guessing you’ve got an irreversible mojibake situation that will require custom code or lookup tables.

In the specific case of the corpus of documents I need to repair (which, by the way, is a very common case), all mojibake are the characters in the Windows-1252 character encoding in the range from 80 through 9F. So I can repair the damaged characters with a small lookup table and a regular expression pattern that matches the substrings that are the damaged characters.

Here's the script I'll use to repair the many text files with the UTF-8/Windows-1252 character encoding damage in them:

#!perl use strict; use warnings; use open qw( :encoding(UTF-8) :std ); use English qw( -no_match_vars ); use File::Glob qw( bsd_glob ); @ARGV or die "Usage: perl $PROGRAM_NAME file ...\n"; local @ARGV = map { bsd_glob($ARG) } @ARGV; local $INPLACE_EDIT = '.bak'; # See http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP +1252.TXT my %mojibake_replace = ( "\x{00E2}\x{201A}\x{00AC}" => "\x{20AC}", # 0x80 EURO SIGN "\x{00C2}\x{0081}" => "\x{0081}", # 0x81 UNDEFINED "\x{00E2}\x{20AC}\x{0161}" => "\x{201A}", # 0x82 SINGLE LOW-9 QUOT +ATION MARK "\x{00C6}\x{2019}" => "\x{0192}", # 0x83 LATIN SMALL LETTE +R F WITH HOOK "\x{00E2}\x{20AC}\x{017E}" => "\x{201E}", # 0x84 DOUBLE LOW-9 QUOT +ATION MARK "\x{00E2}\x{20AC}\x{00A6}" => "\x{2026}", # 0x85 HORIZONTAL ELLIPS +IS "\x{00E2}\x{20AC}\x{00A0}" => "\x{2020}", # 0x86 DAGGER "\x{00E2}\x{20AC}\x{00A1}" => "\x{2021}", # 0x87 DOUBLE DAGGER "\x{00CB}\x{2020}" => "\x{02C6}", # 0x88 MODIFIER LETTER C +IRCUMFLEX ACCENT "\x{00E2}\x{20AC}\x{00B0}" => "\x{2030}", # 0x89 PER MILLE SIGN "\x{00C5}\x{00A0}" => "\x{0160}", # 0x8A LATIN CAPITAL LET +TER S WITH CARON "\x{00E2}\x{20AC}\x{00B9}" => "\x{2039}", # 0x8B SINGLE LEFT-POINT +ING ANGLE QUOTATION MARK "\x{00C5}\x{2019}" => "\x{0152}", # 0x8C LATIN CAPITAL LIG +ATURE OE "\x{00C2}\x{008D}" => "\x{008D}", # 0x8D UNDEFINED "\x{00C5}\x{00BD}" => "\x{017D}", # 0x8E LATIN CAPITAL LET +TER Z WITH CARON "\x{00C2}\x{008F}" => "\x{008F}", # 0x8F UNDEFINED "\x{00C2}\x{0090}" => "\x{0090}", # 0x90 UNDEFINED "\x{00E2}\x{20AC}\x{02DC}" => "\x{2018}", # 0x91 LEFT SINGLE QUOTA +TION MARK "\x{00E2}\x{20AC}\x{2122}" => "\x{2019}", # 0x92 RIGHT SINGLE QUOT +ATION MARK "\x{00E2}\x{20AC}\x{0153}" => "\x{201C}", # 0x93 LEFT DOUBLE QUOTA +TION MARK "\x{00E2}\x{20AC}\x{009D}" => "\x{201D}", # 0x94 RIGHT DOUBLE QUOT +ATION MARK "\x{00E2}\x{20AC}\x{00A2}" => "\x{2022}", # 0x95 BULLET "\x{00E2}\x{20AC}\x{201C}" => "\x{2013}", # 0x96 EN DASH "\x{00E2}\x{20AC}\x{201D}" => "\x{2014}", # 0x97 EM DASH "\x{00CB}\x{0153}" => "\x{02DC}", # 0x98 SMALL TILDE "\x{00E2}\x{201E}\x{00A2}" => "\x{2122}", # 0x99 TRADE MARK SIGN "\x{00C5}\x{00A1}" => "\x{0161}", # 0x9A LATIN SMALL LETTE +R S WITH CARON "\x{00E2}\x{20AC}\x{00BA}" => "\x{203A}", # 0x9B SINGLE RIGHT-POIN +TING ANGLE QUOTATION MARK "\x{00C5}\x{201C}" => "\x{0153}", # 0x9C LATIN SMALL LIGAT +URE OE "\x{00C2}\x{009D}" => "\x{009D}", # 0x9D UNDEFINED "\x{00C5}\x{00BE}" => "\x{017E}", # 0x9E LATIN SMALL LETTE +R Z WITH CARON "\x{00C5}\x{00B8}" => "\x{0178}", # 0x9F LATIN CAPITAL LET +TER Y WITH DIAERESIS ); my $mojibake_regex = qr{ ( \x{00C2}[\x{0081}\x{008D}\x{008F}\x{0090}\x{009D}] | \x{00C5}[\x{00A0}\x{00A1}\x{00B8}\x{00BD}\x{00BE}\x{2019}\x{201C}] | \x{00C6}\x{2019} | \x{00CB}[\x{0153}\x{2020}] | \x{00E2}\x{20AC}[\x{009D}\x{00A0}\x{00A1}\x{00A2}\x{00A6}\x{00B0}\ +x{00B9}\x{00BA}\x{0153}\x{0161}\x{017E}\x{02DC}\x{201C}\x{201D}\x{212 +2}] | \x{00E2}\x{201A}\x{00AC} | \x{00E2}\x{201E}\x{00A2} ) }x; while (<ARGV>) { s/$mojibake_regex/$mojibake_replace{$1}/g; print; } exit 0;

(As always, constructive criticism and earnest suggestions for improvement are welcome and appreciated.)

Don”t know if this particular case is already covered somewhere.

I'm a little surprised by this blind spot in Encode::Repair because, in my experience, this is by far the most ubiquitous kind of mojibake in Latin script text (i.e., text in Western European languages). In fairness to its author, moritz, the documentation includes the following Development section:

  Development
      The source code is stored in a public git repository at
      <http://github.com/moritz/Encode-Repair>. If you find any bugs, please
      used the issue tracker linked from this site.

      If you find a case of messed-up encodings that can be repaired
      deterministically and that's not covered by this module, please contact
      the author, providing a hex dump of both input and output, and as much
      information of the encoding and decoding process as you have.

      Patches are also very welcome.

In reply to Re^4: Why does Encode::Repair only correctly fix one of these two tandem characters? by Jim
in thread Why does Encode::Repair only correctly fix one of these two tandem characters? by Jim

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2024-04-23 08:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found