Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Mass regsub on High-bit chars.

by abaxaba (Hermit)
on May 26, 2005 at 16:47 UTC ( [id://460755]=perlquestion: print w/replies, xml ) Need Help??

abaxaba has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a mass data import that had a lot of cut/paste out of different MS office products (even if they were run on a Mac!). I'm importing over 19000 records.

As you might imagine, I'm running against characters like \x92-\x94, \x80, etc. So, some of these are easy enough:

s|\x{92}|'|g; s|[\x{93}\x{94}]|"g|;
Now, I could single each one out, and figure out what they mean. But - highly impracticle for 19K records. I'd like to convert them to their ascii representations before I shove them into the database. Any quick and dirty out there that accomplishes this?

ÅßÅ×ÅßÅ
"It is a very mixed blessing to be brought back from the dead." -- Kurt Vonnegut

Replies are listed 'Best First'.
Re: Mass regsub on High-bit chars.
by thundergnat (Deacon) on May 26, 2005 at 17:32 UTC

    Unfortunately, no, there is no real premade way to translate them into ASCII because those characters do not exist in ASCII. What to do depends on requrements of your database and what scheme you will use to convert the various characters.

    If your db will support Unicode, the easiest thing to do would be to convert the encoding from cp1252 to utf-8.

    use Encoding; my $record = 'whatever'; from_to($record, 'cp1252', 'utf-8');

    If it will handle Latin-1, then all you need to do is handle the characters from \x80-\x9F than re-encode to 'iso-8859-1'. Other than \x80-\x9F, MS cp1252 and Latin-1 are identical.

    If you really need ASCII, you are going to have to come up with your own transliteration scheme for which ASCII character is an acceptable replacement for the "upper" characters.

    There is a list of cp1252 characters with their Unicode codepoints available at http://www.microsoft.com/typography/unicode/1252.htm

Re: Mass regsub on High-bit chars.
by polettix (Vicar) on May 26, 2005 at 16:55 UTC
    #!/usr/bin/perl use warnings; use strict; my $c = '\x{92}\x{93}\x{94}'; $c =~ s/\\x\{(\d{2})\}/{chr(hex($1))}/ge; print $c, $/; __END__ '""

    Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))')

    Don't fool yourself.
Re: Mass regsub on High-bit chars.
by graff (Chancellor) on May 27, 2005 at 03:00 UTC
    So how many bytes is that? 19,000 records doesn't seem like such a big number, really (unless each record is like a whole megabyte or something).

    If it's a static set of data and you just need a one-shot transform to replace non-ascii with ascii, it wouldn't hurt to do a little diagnosis up front to see what you need to cover:

    # concatenate all your records together into one data stream # and pipe it all through this perl command line: perl -ne 'tr/\x00-\x7f//d; $ch{$_}++ for (split//); END{printf("%x %d\ +n",ord,$ch{$_}) for (sort keys %ch)}' # this prints a histogram of non-ascii byte values
    Sometimes this sort of diagnosis can reveal some unexpected properties (e.g. mistakes) in the data, especially for stuff that has been manually created in (and extracted from) proprietary file formats.

    Example: if 0x93 and 0x94 are supposed to open and close double-quotes, do you get the same quantity of each? If not, maybe some of them mean something else, or maybe some records just happen to have unbalanced quotes (and then you need to decide or be told whether that matters...)

Re: Mass regsub on High-bit chars.
by Anonymous Monk on May 27, 2005 at 14:47 UTC
    MS office products (...) I'm running against characters like \x92-\x94, \x80, etc.
    You want the demoroniser. It's even Perl, heh.
Re: Mass regsub on High-bit chars.
by ambrus (Abbot) on May 27, 2005 at 16:52 UTC

    Firstly, just try to export the files in text format in such a way that you can tell the character set to office. IF that fails, just export as text in whatever Windows-specific character set it likes, and piconv that text to a "normal" character set, such as iso_8859_1, iso_8859_2, utf8, utf16, or whatever you like.

    If, however, you really want a quick-and-dirty solution, and convert to ascii, here's some substitutions. This is assuming that your incoming data is cp1250. Also I omit those characters that are the same in 8859_1 and cp1250, as I think you're not angry with those. So, the only characters here are those that are different in 8859_2 and 8859_1, and the windows extensions.

    s/\x80/EUR/g; s/\x82/,/g; s/\x84/,,/g; s/\x85/.../g; s/\x86/\/\/\-/g; s/\x87/\/\/\=/g; s/\x89/\%0/g; s/\x8a/S\</g; s/\x8b/\</g; s/\x8c/S\'/g; s/\x8d/T\</g; s/\x8e/Z\</g; s/\x8f/Z\'/g; s/\x91/`/g; s/\x92/'/g; s/\x93/``/g; s/\x94/''/g; s/\x95/o/g; s/\x96/--/g; s/\x97/---/g; s/\x99/TM/g; s/\x9a/s\</g; s/\x9b/>/g; s/\x9c/s\'/g; s/\x9d/t\</g; s/\x9e/z\</g; s/\x9f/z\'/g; s/\xa1/\'\</g; s/\xa2/\'\(/g;

    Update: if you want better ascii equivalents, you might be able to generate them from the files in the Unicode directory of links (the browser).

    Update: readmored some of the code

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://460755]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-03-28 20:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found