Cleaning Data Between Specified Columns

enoch has asked for the wisdom of the Perl Monks concerning the following question:

I was charged with writing a program that would "clean" data in a file (remove characters denoted bad, squash apostrophes). The program would accept a varaible number of arguments. The first would be a path to a file. The remaining arguments would be a series of column markers indicating between which columns to do the data cleaning. For example, if you wanted to clean the file data.txt but only between columns 0 to 30, 44 to 63, and 97 to 111, you would call the program like so:

% ./cleanse.pl data.txt 0-30 44-63 97-111
[download]

Here is how I implemented it (changing strings to character arrays, using splice, data munging appropriately, and putting the data back). I was wondering if any monks would have done it differently.

#!/usr/bin/perl
use warnings;
use strict;

##
##    This program accepts the name of a file from the command line
##    and a variable length of extra arguments specifying between whic
+h
##    columns to examine or which fields in a delimited file to examin
+e.
##    It, then, processes the file replacing any apostrophe with nothi
+ng
##    (that is, it squashes any appearance of an apostrophe turning "d
+on't"
##    into "dont" and "O'Connor" into "OConnor").  It, then, replaces
##    anything that is not a alpha-numeric, pipe, new line, or dash wi
+th a space
##

my $fileToCleanse = shift 
    or die "Usage $0 <fileName> <fromColumn - toColumn> " .
    "<fromColumn - toColumn>... where 'fileName' is the name " .
    "of the file to cleanse and the other parameters specify " .
    "the range of columns to cleanse";

open INPUT, $fileToCleanse . ORIG_DATA_FILE_EXT 
    or die "Could not open $fileToCleanse" .
             ORIG_DATA_FILE_EXT . " for reading because:\n$!\t\n";

my $fileContents = '';
my @columnSpanArray = ();

# build a two dimensional array
# to hold each one of the column index paramter pairs
my $index = 0;
foreach(@ARGV)
{
    ($columnSpanArray[$index][0], $columnSpanArray[$index][1]) = 
        split '-', $_;
    $index++;
}

while(my $line = <INPUT>)
{
    my @chars = split '', $line;    # split the line into an array of 
+chars
    
    foreach my $parameters (@columnSpanArray)
    {
        # if the end of the line occurs before the parameter
        # specified to cleanse to, only cleanse until end of line
        # for example, if we are to cleanse from 45 to 115
        # but the line is only 65 characters long, only cleanse up til
+ 65
        my $endOrLineLength = (length($line) > $$parameters[1]) 
                                ? $$parameters[1] 
                                : length($line);
        
        # go to next loop if the paramters exceed the line length
        next if $$parameters[0] >= $endOrLineLength;
        
        # take a slice of the array between the columns to examine
        my $tmpString = join '', @chars[$$parameters[0]..$endOrLineLen
+gth-1];
        
        $tmpString =~ s/(.)'(.)/$1$2/g;            # squash apostrophe
+s
        $tmpString =~ tr/a-zA-Z0-9\n\|\-/ /c;    # remove bad characte
+rs
        
        # put it back into the array from which we got it
        splice(@chars, $$parameters[0], $endOrLineLength-$$parameters[
+0],
            split '', $tmpString);
    }
    
    # store the cleansed data as a string
    $fileContents .= join '', @chars;
} # end while INPUT
    
close INPUT;

# print back the cleansed data to the original file name
open OUTPUT, ">$fileToCleanse"  or 
    die "Could not open $fileToCleanse.cleansed for reading because:\n
+$!\t\n";
    
print OUTPUT $fileContents;
close OUTPUT;
[download]

Enoch

Comment on Cleaning Data Between Specified Columns Select or Download Code

Replies are listed 'Best First'.
Re: Cleaning Data Between Specified Columns by Fletch (Bishop) on Jan 27, 2003 at 19:51 UTC
You can use the fact that the return value from `substr()` is an lvalue and do something like: `{ local *tmp = \substr( $source, $start, $len ); $tmp =~ y/'//d; $tmp =~ y/a-zA-Z0-9\n\\|-/ /c; }` [download]	[reply] [d/l] [select]
Re^2: Cleaning Data Between Specified Columns by Aristotle (Chancellor) on Jan 28, 2003 at 01:51 UTC
Damn. That's great. `:)` And with a little change it's strict compliant and shorter too: `{ local _ = \substr $source, $start, $len; y/a-zA-Z0-9\n\\|-/ /c; y/'//d; }` [download] Update:* changed order of transliterations to account for the deletion of characters causing extra characters to shift in. Makeshifts last the longest.	[reply] [d/l]
Re: Cleaning Data Between Specified Columns by BrowserUk (Patriarch) on Jan 27, 2003 at 21:50 UTC
Using substr for this is kinda tricky. If one of the earlier ranges contains 's, and you delete them, that screws up the indexes for later ranges. One solution is to substitute a known-not-present char (I used \x7F) for ' whilst processing the ranges and then remove these from the resultant line. Update:A modified version to deal with replacing 's with spaces at the end of the field rather than deleting them entirely. Makes use of Fletch's neat trick and Aristotle's enhancement to it, now that is possible. `#! perl -sw use warnings; use strict; open( FILE, '<', shift) or die "Couldn't open $::FILE; $!"; @ARGV = map{ [ split(/-/, $_, 2) ] } @ARGV or die 'usage $0: file c1-c2 [ c1-c2 [ ... ] ] >modified_file'; while (my $line = <FILE>) { for ( @ARGV ) { next if $_->[0] > length $line; $_->[1] = length $line if $_->[1] > length $line; local _ = \substr($line, $_->[0], $_->[1]-$_->[0] + 1); tr[a-zA-Z0-9\n\\|\-'][ ]c; $_ .= ' ' x tr['][]d; } print $line; } close FILE;` [download] Original version Read more... (1069 Bytes) Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.*	[reply] [d/l] [select]
Re: Re: Cleaning Data Between Specified Columns by enoch (Chaplain) on Jan 27, 2003 at 22:17 UTC
++BrowserUK! I was bitten by my squashing of apostrophes. Because it was a fixed width file, squashing the apostrophes caused the width's to change. I changed the regex to: `s/(.)'(.\B*)/$1$2 /g` [download] So, that spacing was added for each apostrophe I pulled from any field. enoch	[reply] [d/l]
Re: Re: Re: Cleaning Data Between Specified Columns by BrowserUk (Patriarch) on Jan 27, 2003 at 22:26 UTC
In that case, there's no need for this line `$tmpString =~ s/(.)'(.)/$1$2/g; # squash apostrophes` [download] As the next line `$tmpString =~ tr/a-zA-Z0-9\n\\|\-/ /c; # remove bad characters` [download] will convert them to spaces anyway? Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.	[reply] [d/l] [select]
Re: Re: Re: Re: Cleaning Data Between Specified Columns by enoch (Chaplain) on Jan 28, 2003 at 02:31 UTC
Re^2: Cleaning Data Between Specified Columns by Aristotle (Chancellor) on Jan 28, 2003 at 01:54 UTC
Using "known nonexistant" characters is just asking for trouble.. it's a practice I've come to regard as a huge red flag. In this particular case and with Perl being Perl, the proper solution is surprising but very neat. Fletch++ Makeshifts last the longest.	[reply]
Re: Re^2: Cleaning Data Between Specified Columns by BrowserUk (Patriarch) on Jan 28, 2003 at 03:05 UTC
Sorry Aristotle. Fletch's (partial) solution, neat as the technique is, falls foul of the fact that deleting the apostrophies in a one range, causes all the subsequent columns to shift. Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.	[reply]
Re^4: Cleaning Data Between Specified Columns by Aristotle (Chancellor) on Jan 28, 2003 at 09:35 UTC
Re: Cleaning Data Between Specified Columns by Aragorn (Curate) on Jan 27, 2003 at 20:41 UTC
Taking one of the suggestions above, keeping your program flow intact, and mixing it a bit with my personal style, I come up with this: #!/usr/bin/perl use strict; use warnings; # Store the ranges in a list of hashes. my @Ranges = (); foreach my $arg (@ARGV) { my ($start, $end) = split(/-/, $arg); push @Ranges, { start => $start, end => $end }; } # Use standard input/output so we can pipe files through it. while (my $line = <STDIN>) { # We go through the columns in reverse so the offsets will be # correct. foreach my $range (reverse @Ranges) { my ($start, $end) = ($range->{start}, $range->{end}); my $line_len = length($line); $end = $line_len if $line_len < $end; next if $start > $end; # Skip out-of-bound ranges my $nchars = $end - $start; substr($line, $start-1, $end-$start) = cleanse(substr($line, $start-1, $end-$start)); } print $line; } # Remove apostrophes and other unwanted characters. sub cleanse { my $string = shift; $string =~ s/'//g; $string =~ tr/a-zA-Z0-9\n\\|\-/ /c; return $string; } [download] It could use some extra checking on the program arguments (ranges), though. Arjen	[reply] [d/l]
Re: Cleaning Data Between Specified Columns by Anonymous Monk on Jan 27, 2003 at 19:24 UTC
You could do something like the following: `substr($line, $pos1, $pos2) = cleanse(substr($line, $pos1, $pos2));` [download] You would have to work from the high indicies to the low ones in order to avoid messing up your offset counts.	[reply] [d/l]
Re: Re: Cleaning Data Between Specified Columns by tall_man (Parson) on Jan 27, 2003 at 19:29 UTC
The third parameter should be a length, not a position. `substr($line, $pos1, $pos2-$pos1+1) = cleanse(substr($line, $pos1, $po +s2-$pos1+1));` [download] Update:Here is some sample code, simplified in its input and output. The "local *tmp" trick by Fletch would also work, but I found it wouldn't pass strict without a "`use vars qw($tmp);`", so I went back to the simpler temporary variable idea. `use strict; my @pos = (0,5,15,20); while(my $line = <>) { my $i = $#pos; while ($i >= 1) { my $pos1 = $pos[$i - 1]; my $pos2 = $pos[$i]; my $len = $pos2 - $pos1 + 1; my $part = substr( $line, $pos1, $len ); $part =~ y/'//d; $part =~ y/a-zA-Z0-9\n\\|-/ /c; substr( $line, $pos1, $len ) = $part; $i -= 2; } # end while $i print $line; } # end while` [download]	[reply] [d/l] [select]


Do you know where your variables are?
	PerlMonks