Re: Optimization of script

First of all, I think ww is exactly right in this post with the suggestion to use a database. With proper indexing matching rows between tables will run very quickly indeed.

I get the idea that the approach is to try to move through all 3 data files, keeping careful track of your progress so that you don't lose your place. You might find it useful to write a subroutine that handles the searching through each file, keeping a pointer to the last place something was found.

The following is crude, and incomplete, but I hope gives some idea of what I'm suggesting. The hash %output uses the input file names as a key, and contains a reference to an array which will hold the lines that need to be written out to the corresponding output file. The hash %filepointers is used to keep track of the last spot in the input files where an account was found.

use strict;
use warnings;
open my $controlFileHandle,'<','control.csv';
my ($file1, $file2, $file3) = qw/file1.csv file2.csv file3.csv/;

my (%output, %filepointers,$ofh,$infile,$line,$account,$accountline );
while ($accountline = <$controlFileHandle>) {
  $account = (split /,/,$accountline,2)[0];
  lookForAccountInFile($account,$file1);
  lookForAccountInFile($account,$file2);
  lookForAccountInFile($account,$file3);
}

foreach $infile (keys %output) {
  open $ofh,'>',"new_$infile";
  foreach $line (@{$output{$infile}}) {
    print $ofh $line
  }
  close $ofh;
}

sub lookForAccountInFile {
  my ($account,$file) = @_;
  open my $ifh,'<',$file;
  if (defined $filepointers{$file}) {
    seek $ifh, $filepointers{$file}, 0
  }
  my $found = 0;
  while (my $line = <$ifh>) {
    last if ($line eq "\n");
    my $la = (split /,/,$line,2)[0];

    last if ($la > $account);
    if ($la == $account) {
      push @{$output{$file}},$line;
      $found = 1;
    }
    $filepointers{$file} = tell($ifh);
    last if ($found);
  }
}
[download]

Updated to compile and run. I will show my test files below. To answer specifically how this approach helps with optimizing, 1) using a subroutine means you need only fine-tune the code once,and then gain the benefit as many times as you use it; 2) this reduces considerably the number of arrays and other variables. I did not use the CSV module since it appeared you only were using the very first column in each record. If you will need to explore the records more in depth, you will absolutely want to incorporate that module. It handles all sorts of special cases that would trip up this approach. One of your requirements that I did not implement was the account limit per file. But if I did everything, you'd have no opportunity to learn! ;)

Data:
control.csv:
1,control,record,1
2,control,khrecord,2
5,control,recordi,5
7,control,record,7

file1.csv:
1,file,1,record,1
2,file,1,record,2
3,file,1,record,3

file2.csv:
2,record,1,file,2
4,record,2,file,2
5,record,3,file,2
6,record,4,file,2
7,record,5,file,2

file3.csv:
4,file,3,record,4
5,file,3,record,5
[download]

But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

Comment on Re: Optimization of script Select or Download Code


Syntactic Confectionery Delight
	PerlMonks