Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked


by RhetTbull (Curate)
on Jul 31, 2001 at 21:18 UTC ( #101236=note: print w/replies, xml ) Need Help??

in reply to

Nice idea -- it's something I've had to do before. I would recommed two changes.

1. Instead of specifying the files on the command line, use the Unix "filter" paradigm where you read in a file (either from file or STDIN) and write it out to STDOUT. That way the user could do something like (depending on their shell): bigfile > bigfile2 or something like:
for file in *.mat; do echo $file; ./ $file > $file.2; done in order to process a bunch of files.

2. Don't open and close the file so many times! Use seek instead. It's probably faster. For a file with many cols, you will open and close the file a lot -- that takes up time. I did a quick benchmark and on my system here are the results from reading a large file hundreds of times:

Benchmark: timing 100 iterations of openclose, seek... openclose: 186 wallclock secs (161.45 usr + 19.77 sys = 181.22 CPU) @ + 0.55/s (n=100) seek: 17 wallclock secs (16.08 usr + 1.02 sys = 17.10 CPU) @ 5.85/s +(n=100)
Because your program is doing a lot of I/O and other things (like pushing stuff onto big arrays) not all your time is spent opening and closing files so the speedup won't be as dramatic as the simple benchmark but it will be faster. I've made a small change (changed 3 lines) to your program to use seek instead of repeated open/close. Using the modified code on a file with 1000 columns, it ran about 25% faster than yours (a significant improvement if the file is really big).
Here's your sub bigfiles_colstolines modified to use seek:
sub bigfile_colstolines { my $infile = shift; my $outfile = shift; my $infilehandle = "<$infile"; # read-only open (INFILE, $infilehandle) or die ("File error.\a\n"); my $outfilehandle = ">$outfile"; # write only open (OUTFILE, $outfilehandle) or die ("Output failure.\a\n"); my $line = <INFILE>; my @testarray = split (/$delimiter/, $line); close (INFILE); open (INFILE, $infilehandle) or die ("File error.\a\n"); for (my $counter=0; $counter <= $#testarray; $counter++){ my @columnarray = undef(); while (defined ($line = <INFILE>)){ chomp ($line); my @linearray = split (/$delimiter/, $line); push (@columnarray, $linearray [$counter]); } shift (@columnarray); # removes unwanted characters my $newline = join $delimiter, (@columnarray); print OUTFILE ($newline, "\n"); #rewind the file seek(INFILE,0,0); } close(INFILE); close(OUTFILE); }

Replies are listed 'Best First'.
Re: Re:
by tfrayner (Curate) on Aug 02, 2001 at 03:35 UTC
    Thanks for the tips. I must admit, I hadn't appreciated the overhead involved with the repeated open/close operations. Turns out seek is my new best friend :-)

    The reason the STDIN/STDOUT unix filter paradigm wasn't implemented in this script has more to do with its original development than with with its final functionality. The first script was written on a Mac and I haven't found a good way to deal with STDOUT using MacPerl. Although it did make for some easily-implemented dialogue boxes :-)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://101236]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (10)
As of 2021-03-02 21:18 GMT
Find Nodes?
    Voting Booth?
    My favorite kind of desktop background is:

    Results (63 votes). Check out past polls.