Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

tips on how to speed up a very slow perl script requested

by Angharad (Pilgrim)
on May 06, 2009 at 14:28 UTC ( [id://762285]=perlquestion: print w/replies, xml ) Need Help??

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

I have two data files that I'm working with right now. They are fairly large, but I'll just show snippets of both for the purposes of this post. File 1 is as follows:
Y 39 1jpeA Y 40 1jpeA L 41 1jpeA Y 42 1jpeA R 43 1jpeA K 44 1jpeA Q 45 1jpeA L 72 1gtiA G 73 1gtiA R 74 1gtiA S 75 1gtiA L 76 1gtiA etc ...1jpe,0,TYR,A,42
and then file 2
1jpe,0,CYS,A,109 1jpe,0,CYS,A,103 1jpe,0,TYR,A,42 1jpe,0,ASP,A,68 1jpe,0,TYR,A,71 1jpe,0,PHE,A,70 etc ...
For each item (referred to as, for example '1jpeA' in file 1 and '1jpe' and 'A' in file 2 (you can get other items called 1jpeB, 1jpeC etc so we need to obtain what is called the chain letter (A) as well as the '1jpeA' code), search for entries that are equivalent (that is share the same item code and number (the numbers of interest are found in column 2 for file 1 and column 5 for file 2) and then print off these instances into a different file. so, for the example above, the result is
1jpeA 42
As both files have an entry for 1jpeA sharing the number 42, but share no other entries - i.e.
Y 42 1jpeA (from file 1) 1jpe,0,TYR,A,42 (from file 2)
I've written a perl script to do this task. It works but its very slow and I was wondering if anyone had any tips on how I might speed it up. Heres the code
#!/usr/local/bin/perl use strict; use warnings; my $aa; my $num; my $str; my $hashlookup; my $pdb; my $csanum; my $chain; my $resfile = shift; my $csafile = shift; my %hash; open(RESFILE, "$resfile") or die "unable to open $resfile: $!\n"; open(CSAFILE, "$csafile") or die "unable to open $csafile: $!\n"; #my @resarray = <RESFILE>; #close(RESFILE); my @csaarray = <CSAFILE>; close(CSAFILE); #for (my $i = 0; $i < @resarray; $i++) while(<RESFILE>) { #my @array = split /\s+/, $resarray[$i]; my @array = split /\s+/, $_; # lets do a test print $aa = $array[0]; $num = $array[1]; $str = $array[2]; $str = substr($str, 0, 5); #print "test $str $num $aa\n"; for (my $j = 0; $j < @csaarray; $j++) { my @fields = split /\,/, $csaarray[$j]; $pdb = $fields[0]; $csanum = $fields[4]; $chain = $fields[3]; #print "test2 $pdb $csanum $chain\n"; my $pdbchain = "$pdb" . "$chain"; if("$str" eq "$pdbchain") { if (!$hash{$str}{$csanum}) { $hash{$str}{$csanum}++; print "$str $csanum\n"; } } } }
Thanks in advance :)

Replies are listed 'Best First'.
Re: tips on how to speed up a very slow perl script requested
by moritz (Cardinal) on May 06, 2009 at 14:39 UTC
    If you use a hash lookup in the inner loop instead of iterating over all the elements of the second first, you can get a huge speed gain. See perldata for more information about hashes.
      Yes, I tried that but it wasn't very successful (seeing as my grasp of hashes is rather pathetic)
        Here is what you need to do:
        1. Declare a hash
        2. Read the first file line by line, populating the hash
        3. iterate over the second file, and look up the values in the hash

        In your current program, you compare things to $pdbchain, so your hash keys have to be constructed exactly in the same way as $pdbchain

        Give it a try, and when you have problems, come back with more specific questions.

        Please, when you have an instance like this, post your code. Doubtless someone here can find and correct the error. If you can't post it because it's long since been over-written, I'd suggest you look into a Version Control System. Sometimes good approaches get discarded because a detail or two gets in the way.

        HTH,

        planetscape
Re: tips on how to speed up a very slow perl script requested
by DStaal (Chaplain) on May 06, 2009 at 14:54 UTC

    I'm sure there will be a fair number of people giving specific code advice. I'll give programming advice:

    Once you know you have speed problems, profile. Don't assume you know what is fast and what is slow, and don't assume you know what is being run the most often.

    There is an adequate profiler that comes with Perl, that being Devel::DProf. Read up, and see what the basics are. It will help in many cases, but isn't all that extensive.

    A better profiler is available in Devel::NYTProf, which will tell you exactly how many times each line of code (or block, or subroutine) is called, and how long it took. Feed your program to that with a representative sample of your data, and work out ways to avoid the slowest operations.

      Good general advice, but but profiling the OPs code won't tell you anything that isn't readily apparent from a cursory inspection.

      Matching N elements from one file against M elements from a second, by iterating through an array containing the M elements--and respliting each element each time through--it is obvious enough where the problems lie.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      Thats really good advice. Thanks. Definitely something to remember
Re: tips on how to speed up a very slow perl script requested
by jpearl (Scribe) on May 06, 2009 at 16:00 UTC

    I'm not exactly a super expert or anything, but this script might get you where you want to go. Beware, I've only done minimal testing, and just on the data you provided. It also comes with the caveat (as has been previously mentioned) that it'll very much depend on the size of these files. However since you are reading them into memory in your initial script, its probably going to be alright.

    #!/usr/local/bin/perl use strict; use warnings; my $resfile = shift; my $csafile = shift; open(RESFILE, "$resfile") or die "unable to open $resfile: $!\n"; open(CSAFILE, "$csafile") or die "unable to open $csafile: $!\n"; open(OUTFILE, ">outfile") or die "Can't open the outfile: $!\n"; my %REShash; #concatenate number and id for (hopefully) unique hash key while (<RESFILE>){ /([A-Z])\s(\d+)\s([0-9a-zA-Z]+)/; my $currentID = $3.$2; $REShash{$currentID}=$1; } while(<CSAFILE>) { chomp; my @record = split (/,/, $_); #concatenate all parts of the record to get same formated hash key my $currentId= $record[0].$record[3].$record[4]; if (exists $REShash{$currentId}){ print OUTFILE join(',', @record)."\n"; } }

    Also, what is being printed to the "outfile" is in the form of what I believe you are calling the CSAFILE (I may have those two switched around), fyi

Re: tips on how to speed up a very slow perl script requested
by jwkrahn (Abbot) on May 06, 2009 at 18:30 UTC

    It looks like you want something like:

    #!/usr/local/bin/perl use strict; use warnings; my $resfile = shift; my $csafile = shift; open my $RESFILE, '<', $resfile or die "unable to open $resfile: $!\n" +; my %hashlookup; while ( <$RESFILE> ) { my ( undef, $num, $str ) = split; $hashlookup{ substr( $str, 0, 5 ) . " $num" } = 1; } close $RESFILE; open my $CSAFILE, '<', $csafile or die "unable to open $csafile: $!\n" +; while ( <$CSAFILE> ) { chomp; my ( $pdb, $chain, $csanum ) = ( split /,/ )[ 0, 3, 4 ]; my $key = "$pdb$chain $csanum"; print "$key\n" if $hashlookup{ $key }; } close $CSAFILE;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://762285]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-04-18 01:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found