Translating non-printable ascii

samurai has asked for the wisdom of the Perl Monks concerning the following question:

Masters of the dromedary,

My current project is to extract data from a proprietary format to MySQL. I use the database vendor's tool to dump the files to normal ASCII, and then I process them.

I recently got my hands on the spec for the proprietary format. I now have the knowlege to "decode" the proprietary without using their tool to dump the file to ASCII (we're talking 27-30 gig files here. Disk usage is a big concern with this method.).

My question involves iterating over string contents. The "compression" algorithm is incredibly simplistic, but effective. It uses run-length encoding for blank spaces (0xFF byte followed by ASCII byte value equaling length), and turns consecutive digits into the non-printable ASCII values. For example...

"00" is turned into ASCII 0x80
"01" is turned into ASCII 0x81
...
"98" is turned into ASCII 0xEB
"99" is turned into ASCII 0xEC

Now I can get to my question. Running speed is of the utmost importance here. I know that perl could never do this as fast as the proprietary C utility that I use to dump these 30 gig files. But if I can avoid creating temp files and read them natively in perl, I can avoid disk usage issues.

What is the most efficient way to translate those ASCII bytes in perl? Perl's smallest character value, IIRC, is the string. I need to be able to translate, as per the table above, any ASCII 0x80 into "00" in place in the string, ASCII 0x81 into "01", and so on.

I guess I could do s///, but regexes would probably be ridiculously slow. Or use index once per each type of replacement character, in combination with substring. But that would be running index 99 times (or more if there's more than one instance of the character) on over four million records @_@

I got my start coding in perl. So I am used to dealing with data in strings, not arrays of bytes. If anyone can help point me in the right direction for coding this up in the most efficient way possible, I'd be very grateful.

--
perl: code of the samurai

Comment on Translating non-printable ascii

Replies are listed 'Best First'.
Re: Translating non-printable ascii by Roy Johnson (Monsignor) on Oct 04, 2004 at 17:01 UTC
You don't have to run the index 99 times to use s///. Whether it's fast enough is a matter of trying it and seeing, but the expression should be: `s/(\d\d)/chr(0x80+$1)/ge;` [download] (actually, it's not clear what it should be, if your translations are right: 0xEC is 108 more than 0x80. But here's an example script: `$_="00 01 98 99"; s/(\d\d)/chr(0x80 + $1)/ge; @f= map {sprintf "%x", $_} unpack("C", $_); print "@f\n";` [download] Output: `80 20 81 20 e2 20 e3` [download] Caution:* Contents may have been coded under pressure.	[reply] [d/l] [select]
Re^2: Translating non-printable ascii by samurai (Monk) on Oct 04, 2004 at 17:10 UTC
I was trying to save space by omitting the fact that "." is counted as a digit... so I'm missing the translation values for "0.", "1.", "2." etc etc. Sorry. I apologize, but I'm not sure I understand what your script is doing. I need to translate the other way. I need to turn ASCII 0x80 into "00", instead of the other way around. Or maybe I'm not understanding your answer. Here's a better example of what I'm trying to accomplish I need to turn (ascii characters in curly braces): MO{ASCII 0x81}B{ASCII 0x8D}CAJ{ASCII 0xA3} into: MO01B12CAJ32 Does that explain it better? -- perl: code of the samurai	[reply]
Re^3: Translating non-printable ascii by Roy Johnson (Monsignor) on Oct 04, 2004 at 17:31 UTC
Ok, here's how I would do it the other way: Make a lookup table of the translations (since it's not straightforward base conversion): `$_="MO\x81B\x8dCAJ\xa3"; my $start = 0x80; my %xlate = map { my $first = $_; map {(chr($start++), "$first$_")} (0..9, '.') } (0..9,'.') ; s/([\x80-\xec])/$xlate{$1}/g; print;` [download] The compound map builds the translation table. Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re: Translating non-printable ascii by Fletch (Bishop) on Oct 04, 2004 at 17:19 UTC
If you're really concerned about speed, the best thing to do would be to find a handy C programmer and give them your spec and have them write something you can wrap with Inline::C.	[reply]
Re: Translating non-printable ascii by Thelonius (Priest) on Oct 04, 2004 at 17:41 UTC
Here's one way to do it: `#!perl -w use strict; my @ddig; my @spaces; $ddig[$_ + 128] = sprintf "%02d", $_ for 0 .. 99; $spaces[$_] = " " x $_ for 0 .. 255; while (<DATA>) { s/([\x80-\xe3])/$ddig[ord($1)]/g; s/\xff(.)/$spaces[ord($1)]/eg; print; } __DATA__ MOBCAJ£ This is aÿtest 123456789012345678901234567890` [download] Please note, as Roy Johnson says above, that either your specification is unclear or your arithmetic is wrong. I'm assuming that "\x00-\xe3" maps to "00" through "99". The example you gives maps to "MO01B13CAJ35" with my code, not "MO01B12CAJ32".	[reply] [d/l]
Re: Translating non-printable ascii by graff (Chancellor) on Oct 06, 2004 at 05:15 UTC
At first, you said: The "compression" algorithm ... uses run-length encoding for blank spaces (0xFF byte followed by ASCII byte value equaling length), and turns consecutive digits into the non-printable ASCII values... What is the most efficient way to translate those ASCII bytes in perl? But then later you give this "example": I need to turn (ascii characters in curly braces): MO{ASCII 0x81}B{ASCII 0x8D}CAJ{ASCII 0xA3} into: MO01B12CAJ32 The example doesn't show the 0xFF bytes that you say should precede the RLE count values (and if x81 is 01, then x8D should be 13 and xA3 should be 35), but I digress. If the byte sequence xFFxYY (where "YY" is a byte value between x80 and xFF) is supposed to represent string of "blanks" (i.e. between 0 and 128 space characters), it sounds like the original (pre-RLE-compressed) data stream is just a fixed-width flat file, and the "xFFxYY" sequences are just field separators. So consider the following questions: Does the input data contain line breaks (LF or CRLF) to separate the rows? Regardless of that, do you know how many fields make up each row? (I presume you do, since you're importing the data into a mysql table.) If you were to add up the "uncompressed" number of characters in each row (this would be the sum of the lengths of the "printable" fields plus the sum of the "non-printable" RLE counts for spaces), would you always get the same total width for each row? I'm guessing that the answer to the third question is "yes", and that for each pair of "printable field value" and following "non-printable RLE count value", the total length of these two values will alway be the same for a given field. That would mean that the RLE count is predictable from the number of characters in the preceding "printable" field. It also means that there is no need for you to retain the RLE counts. Just treat any sequence of two or more "non-printable" bytes as a field separator. Make sure that you can correctly determine the end of a "row", and push the field data into mysql; e.g. if the input data has "normal" line-breaks, you could handle it as follows: `while (<INPUT>) { s/[\x80-\xff]+/\t/g; # turn all field separators into tabs print OUTPUT; }` [download] (If there are no line-breaks or other explicit markers of row boundaries, it's a little trickier to do it in an optimal way, but it's still quite doable.) This assumes that the original data never contains a tab as part of a field value -- probably a safe assumption in fixed-width flat file data, but if tabs do appear as data, just use something else (maybe even a particular "non-printable" character like "\xB7" or "\xA0"). Having a single, consistent field-separator character makes it trivial to import the data into mysql. It also saves a fair number of bytes in the file that you use for loading into mysql. You could of course use DBI to pump the data directly into mysql, but if you'll be doing this sort of data transfer a lot, you'll want to test how long it takes using DBI and no temp file, as opposed to feeding a temp file to mysqlimport (i.e. using the mysql-native "LOAD DATA INFILE" mechanism). In general, the latter goes a lot faster than running "insert" statements in DBI; even with the time it takes to generate the temp file, you could still come out ahead. (For that matter, it looks like mysql 4.1 and later will support feeding "mysqlimport" via a pipe, but I haven't tried this.)	[reply] [d/l]


Just another Perl shrine
	PerlMonks