Re: cols2lines.pl

Nice idea -- it's something I've had to do before. I would recommed two changes.

1. Instead of specifying the files on the command line, use the Unix "filter" paradigm where you read in a file (either from file or STDIN) and write it out to STDOUT. That way the user could do something like (depending on their shell): cols2lines.pl bigfile > bigfile2 or something like:
for file in *.mat; do echo $file; ./cols2lines.pl $file > $file.2; done in order to process a bunch of files.

2. Don't open and close the file so many times! Use seek instead. It's probably faster. For a file with many cols, you will open and close the file a lot -- that takes up time. I did a quick benchmark and on my system here are the results from reading a large file hundreds of times:

Benchmark: timing 100 iterations of openclose, seek...
openclose: 186 wallclock secs (161.45 usr + 19.77 sys = 181.22 CPU) @ 
+ 0.55/s (n=100)
seek: 17 wallclock secs (16.08 usr +  1.02 sys = 17.10 CPU) @  5.85/s 
+(n=100)
[download]

Because your program is doing a lot of I/O and other things (like pushing stuff onto big arrays) not all your time is spent opening and closing files so the speedup won't be as dramatic as the simple benchmark but it will be faster. I've made a small change (changed 3 lines) to your program to use seek instead of repeated open/close. Using the modified code on a file with 1000 columns, it ran about 25% faster than yours (a significant improvement if the file is really big).

Here's your sub bigfiles_colstolines modified to use seek:

sub bigfile_colstolines {
    my $infile = shift;
    my $outfile = shift;
    my $infilehandle = "<$infile";            # read-only
    open (INFILE, $infilehandle) or die ("File error.\a\n");
    my $outfilehandle = ">$outfile";       # write only
    open (OUTFILE, $outfilehandle) or die ("Output failure.\a\n");
    my $line = <INFILE>;
    my @testarray = split (/$delimiter/, $line);
    close (INFILE);
    open (INFILE, $infilehandle) or die ("File error.\a\n");

    for (my $counter=0; $counter <= $#testarray; $counter++){
    my @columnarray = undef();
    while (defined ($line = <INFILE>)){
        chomp ($line);
        my @linearray = split (/$delimiter/, $line);
        push (@columnarray, $linearray [$counter]);
    }
    shift (@columnarray);              # removes unwanted characters
    my $newline = join $delimiter, (@columnarray);
    print OUTFILE ($newline, "\n");
    #rewind the file
    seek(INFILE,0,0);
    }
    close(INFILE);
    close(OUTFILE);
}
[download]

Comment on Re: cols2lines.pl Select or Download Code

Replies are listed 'Best First'.

Re: Re: cols2lines.pl
by tfrayner (Curate) on Aug 02, 2001 at 03:35 UTC

seek

The reason the STDIN/STDOUT unix filter paradigm wasn't implemented in this script has more to do with its original development than with with its final functionality. The first script was written on a Mac and I haven't found a good way to deal with STDOUT using MacPerl. Although it did make for some easily-implemented dialogue boxes :-)

[reply]


No such thing as a small change
	PerlMonks