tips on how to speed up a very slow perl script requested

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

I have two data files that I'm working with right now. They are fairly large, but I'll just show snippets of both for the purposes of this post. File 1 is as follows:

Y 39 1jpeA
Y 40 1jpeA
L 41 1jpeA
Y 42 1jpeA
R 43 1jpeA
K 44 1jpeA
Q 45 1jpeA
L 72 1gtiA
G 73 1gtiA
R 74 1gtiA
S 75 1gtiA
L 76 1gtiA
etc ...1jpe,0,TYR,A,42
[download]

and then file 2

1jpe,0,CYS,A,109
1jpe,0,CYS,A,103
1jpe,0,TYR,A,42
1jpe,0,ASP,A,68
1jpe,0,TYR,A,71
1jpe,0,PHE,A,70
etc ...
[download]

For each item (referred to as, for example '1jpeA' in file 1 and '1jpe' and 'A' in file 2 (you can get other items called 1jpeB, 1jpeC etc so we need to obtain what is called the chain letter (A) as well as the '1jpeA' code), search for entries that are equivalent (that is share the same item code and number (the numbers of interest are found in column 2 for file 1 and column 5 for file 2) and then print off these instances into a different file. so, for the example above, the result is

1jpeA 42
[download]

As both files have an entry for 1jpeA sharing the number 42, but share no other entries - i.e.

Y 42 1jpeA (from file 1)
1jpe,0,TYR,A,42 (from file 2)
[download]

I've written a perl script to do this task. It works but its very slow and I was wondering if anyone had any tips on how I might speed it up. Heres the code

#!/usr/local/bin/perl

use strict;
use warnings;

my $aa;
my $num;
my $str;
my $hashlookup;

my $pdb;
my $csanum;
my $chain;

my $resfile = shift;
my $csafile = shift;

my %hash;

open(RESFILE, "$resfile") or die "unable to open $resfile: $!\n";
open(CSAFILE, "$csafile") or die "unable to open $csafile: $!\n";

#my @resarray = <RESFILE>;

#close(RESFILE);

my @csaarray = <CSAFILE>;

close(CSAFILE);

#for (my $i = 0; $i < @resarray; $i++)
while(<RESFILE>)
{
    #my @array = split /\s+/, $resarray[$i];

    my @array = split /\s+/, $_;

    # lets do a test print

    $aa = $array[0];
    $num = $array[1];
    $str = $array[2];

    $str = substr($str, 0, 5);

    #print "test $str $num $aa\n";

    for (my $j = 0; $j < @csaarray; $j++)
   {
       my @fields = split /\,/, $csaarray[$j];
       
       $pdb = $fields[0];
       $csanum = $fields[4];
       $chain = $fields[3];
       
       #print "test2 $pdb $csanum $chain\n";
       
       
       my $pdbchain = "$pdb" . "$chain";
       
       if("$str" eq "$pdbchain")
       {
       
       if (!$hash{$str}{$csanum})
       {
           $hash{$str}{$csanum}++;
           
           print "$str $csanum\n";
       }
       
       }
       
   }
    
}
[download]

Thanks in advance :)

Comment on tips on how to speed up a very slow perl script requested Select or Download Code

Replies are listed 'Best First'.
Re: tips on how to speed up a very slow perl script requested by moritz (Cardinal) on May 06, 2009 at 14:39 UTC
If you use a hash lookup in the inner loop instead of iterating over all the elements of the second first, you can get a huge speed gain. See perldata for more information about hashes.	[reply]
Re^2: tips on how to speed up a very slow perl script requested by Angharad (Pilgrim) on May 06, 2009 at 14:51 UTC
Yes, I tried that but it wasn't very successful (seeing as my grasp of hashes is rather pathetic)	[reply]
Re^3: tips on how to speed up a very slow perl script requested by moritz (Cardinal) on May 06, 2009 at 15:23 UTC
Here is what you need to do: Declare a hash Read the first file line by line, populating the hash iterate over the second file, and look up the values in the hash In your current program, you compare things to `$pdbchain`, so your hash keys have to be constructed exactly in the same way as `$pdbchain` Give it a try, and when you have problems, come back with more specific questions.	[reply] [d/l] [select]
Re^3: tips on how to speed up a very slow perl script requested by planetscape (Chancellor) on May 07, 2009 at 03:21 UTC
Please, when you have an instance like this, post your code. Doubtless someone here can find and correct the error. If you can't post it because it's long since been over-written, I'd suggest you look into a Version Control System. Sometimes good approaches get discarded because a detail or two gets in the way. HTH, planetscape	[reply]
Re: tips on how to speed up a very slow perl script requested by DStaal (Chaplain) on May 06, 2009 at 14:54 UTC
I'm sure there will be a fair number of people giving specific code advice. I'll give programming advice: Once you know you have speed problems, profile. Don't assume you know what is fast and what is slow, and don't assume you know what is being run the most often. There is an adequate profiler that comes with Perl, that being Devel::DProf. Read up, and see what the basics are. It will help in many cases, but isn't all that extensive. A better profiler is available in Devel::NYTProf, which will tell you exactly how many times each line of code (or block, or subroutine) is called, and how long it took. Feed your program to that with a representative sample of your data, and work out ways to avoid the slowest operations.	[reply]
Re^2: tips on how to speed up a very slow perl script requested by BrowserUk (Patriarch) on May 06, 2009 at 15:33 UTC
Good general advice, but but profiling the OPs code won't tell you anything that isn't readily apparent from a cursory inspection. Matching N elements from one file against M elements from a second, by iterating through an array containing the M elements--and respliting each element each time through--it is obvious enough where the problems lie. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: tips on how to speed up a very slow perl script requested by Angharad (Pilgrim) on May 06, 2009 at 16:47 UTC
Thats really good advice. Thanks. Definitely something to remember	[reply]
Re: tips on how to speed up a very slow perl script requested by jpearl (Scribe) on May 06, 2009 at 16:00 UTC
I'm not exactly a super expert or anything, but this script might get you where you want to go. Beware, I've only done minimal testing, and just on the data you provided. It also comes with the caveat (as has been previously mentioned) that it'll very much depend on the size of these files. However since you are reading them into memory in your initial script, its probably going to be alright. #!/usr/local/bin/perl use strict; use warnings; my $resfile = shift; my $csafile = shift; open(RESFILE, "$resfile") or die "unable to open $resfile: $!\n"; open(CSAFILE, "$csafile") or die "unable to open $csafile: $!\n"; open(OUTFILE, ">outfile") or die "Can't open the outfile: $!\n"; my %REShash; #concatenate number and id for (hopefully) unique hash key while (<RESFILE>){ /([A-Z])\s(\d+)\s([0-9a-zA-Z]+)/; my $currentID = $3.$2; $REShash{$currentID}=$1; } while(<CSAFILE>) { chomp; my @record = split (/,/, $_); #concatenate all parts of the record to get same formated hash key my $currentId= $record[0].$record[3].$record[4]; if (exists $REShash{$currentId}){ print OUTFILE join(',', @record)."\n"; } } [download] Also, what is being printed to the "outfile" is in the form of what I believe you are calling the CSAFILE (I may have those two switched around), fyi	[reply] [d/l]
Re: tips on how to speed up a very slow perl script requested by jwkrahn (Abbot) on May 06, 2009 at 18:30 UTC
It looks like you want something like: #!/usr/local/bin/perl use strict; use warnings; my $resfile = shift; my $csafile = shift; open my $RESFILE, '<', $resfile or die "unable to open $resfile: $!\n" +; my %hashlookup; while ( <$RESFILE> ) { my ( undef, $num, $str ) = split; $hashlookup{ substr( $str, 0, 5 ) . " $num" } = 1; } close $RESFILE; open my $CSAFILE, '<', $csafile or die "unable to open $csafile: $!\n" +; while ( <$CSAFILE> ) { chomp; my ( $pdb, $chain, $csanum ) = ( split /,/ )[ 0, 3, 4 ]; my $key = "$pdb$chain $csanum"; print "$key\n" if $hashlookup{ $key }; } close $CSAFILE; [download]	[reply] [d/l]


No such thing as a small change
	PerlMonks