skinnymofo has asked for the wisdom of the Perl Monks concerning the following question:

I've got a mangled file (yes, it's my fault) that has at least 2 junky Windows control characters inserted at the beginning of every line. Below, you can see my attempt at picking the control characters off, but I'm obviously doing something wrong because it's not working. So, Monks, I beg you, hit me with your best advice.
#!/usr/bin/perl -w use strict; my ($infile, $outfile, $line); $infile = "c:/perldev/restore.txt"; $outfile = "c:/perldev/fixed_restore.txt"; open (INFILE, $infile) || die "Can't open $infile! $!"; open (OUTFILE, ">$outfile) || die "Can't open $outfile! $!"; while (<INFILE>) { $line = $_; $line =~ s/^\D{0,2}|\s{0,2}//; print OUTFILE $line; } close (INFILE); close (OUTFILE);

Replies are listed 'Best First'.
Re: cleaning up control characters
by blackmateria (Chaplain) on Oct 27, 2001 at 00:49 UTC
    Looks like you've almost got it right. I think the problem is this regexp: $line =~ s/^\D{0,2}|\s{0,2}//;

    I'm not sure what you're trying to do there, especially with the \s trimming (are some of these junk characters spaces?) Assuming you actually want to purge control characters (i.e. ascii range 0-31 & friends) and spaces, use the POSIX [:cntrl:] character class, like this (see perlre for more information): $line =~ s/^([[:cntrl:]]|\s){2,}//;

    This should delete all control characters and spaces from the beginning of any lines that start with two or more of them. (Unfortunately it will also strip lines with just leading spaces and no control characters, e.g. indented lines -- without seeing the data I don't know if this matters to you.) But why not just forget the {2,} and eliminate any leading control characters? $line =~ s/^([[:cntrl:]]|\s)+//;

    If you want to keep leading spaces unless they're also mixed in with control characters: $line =~ s/^([[:cntrl:]]|\s)+// if ($line =~ /^([[:cntrl:]]|\s)+/ && $1 =~ /[[:cntrl:]]/);

    I'm not sure if that "clever" trick with the "$1=~" is legit (it syntax checks OK at least); maybe some other monk could clarify this. Unfortunately I don't know what your data looks like, so I can't really test these too well. Hope this helps though.

      Hey blackmateria, thanks! Your suggestion for using the POSIX control class did the trick. FYI, the data is an application error log, so I'm not worried about indents or leading spaces.
Re: cleaning up control characters
by jeroenes (Priest) on Oct 27, 2001 at 00:51 UTC
    Take a look in perlre. \D is the non-digit class, but maybe that isn't what you want. You'd better look up which control chars you can encounter there and use an custom class. Something like:
    s/^[\001-\017\s]{0,2}//; # char 0 to 15 (and whitespace)
    While we're at it, why not simplify it to a one-liner?
    perl -pe 's/^[\001-\017\s]{0,2}//' infile.txt >outfile.txt
    See perlrun for nifty options... the 'i' is also very useful.


    "We are not alone"(FZ)

    Update Listen to blackmateria. Of course you use the :cntrl: POSIX class.. much more convenient.

Re: cleaning up control characters
by monkfish (Pilgrim) on Oct 27, 2001 at 00:52 UTC
    I just changed your regular expression to match 0 to 2 control chars and replace them with nothing otherwise I did not touch the code.
    [:cntrl:] is the key to matching any control chars.

    #!/usr/bin/perl -w use strict; my ($infile, $outfile, $line); $infile = "c:/perldev/restore.txt"; $outfile = "c:/perldev/fixed_restore.txt"; open (INFILE, $infile) || die "Can't open $infile! $!"; open (OUTFILE, ">$outfile) || die "Can't open $outfile! $!"; while (<INFILE>) { $line = $_; $line =~ s/^[:cntrl:]{0,2}//; print OUTFILE $line; } close (INFILE); close (OUTFILE);
    -monkfish (the fishy monk)

    Edit fixed error with \C[

Re: cleaning up control characters
by mandog (Curate) on Oct 27, 2001 at 07:29 UTC