http://qs321.pair.com?node_id=121693


in reply to cleaning up control characters

Looks like you've almost got it right. I think the problem is this regexp: $line =~ s/^\D{0,2}|\s{0,2}//;

I'm not sure what you're trying to do there, especially with the \s trimming (are some of these junk characters spaces?) Assuming you actually want to purge control characters (i.e. ascii range 0-31 & friends) and spaces, use the POSIX [:cntrl:] character class, like this (see perlre for more information): $line =~ s/^([[:cntrl:]]|\s){2,}//;

This should delete all control characters and spaces from the beginning of any lines that start with two or more of them. (Unfortunately it will also strip lines with just leading spaces and no control characters, e.g. indented lines -- without seeing the data I don't know if this matters to you.) But why not just forget the {2,} and eliminate any leading control characters? $line =~ s/^([[:cntrl:]]|\s)+//;

If you want to keep leading spaces unless they're also mixed in with control characters: $line =~ s/^([[:cntrl:]]|\s)+// if ($line =~ /^([[:cntrl:]]|\s)+/ && $1 =~ /[[:cntrl:]]/);

I'm not sure if that "clever" trick with the "$1=~" is legit (it syntax checks OK at least); maybe some other monk could clarify this. Unfortunately I don't know what your data looks like, so I can't really test these too well. Hope this helps though.