Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Iteration speed

by seaver (Pilgrim)
on Jun 15, 2004 at 17:06 UTC ( #366950=perlquestion: print w/replies, xml ) Need Help??

seaver has asked for the wisdom of the Perl Monks concerning the following question:

Dear all

My problem is a bioinformatics problem, I'm currently running a script that is processing 13,000 files which contain co-ordinates for mulit-chain proteins

It calculates the interaction between each of the residues, between each chain (if there are any interactions).

Thus, it is iterating not only through each chain pair, but through every possible residue pair (though not residues in the same chain of course) and seeing if they are close enough, and determining the kind of interaction if so.

My problem, which I fear is unavoidable, as it is compounded by the fact that I cannot avoid to MISS any possible interactions, is that some of the larger files take hours, even more than a day, to process.

At this rate, it can take nearly a year, to go through all the files, which is unfortunate.

So take for example, a 6 chain protein, with approx 3000 residues, that's approx 500 residues per chain. So in one chain pair, there's 500x500 iterations, which is 250,000 iterations, and because there's 6 chains, that's 15 possible chain pairs (avoiding repeats eg: AB == BA) so thats .25 shy of 4 million iterations.

I just wanted to know, what are the potential bottlenecks? One such file (larger than the above example) is still being processed after 1.5 days!

The way my program runs, is that, while it reads in the file, for every new residue it reads in, it iterates through the list currently in memory (avoiding residues in the same chain of course) to look for new interactions, and at the same time, is populating a database with the atomic and residual details, and the interactions if any. Is this a stupid way of doing it?


Replies are listed 'Best First'.
Re: Iteration speed
by tachyon (Chancellor) on Jun 15, 2004 at 18:50 UTC

    4 million iteration is nothing much in the scheme of things. I did a little finacial brute force matching problem that did 3.5 million in under a minute with about 50 lines of munging in the loops.

    Gererally this sort of problem comes down to a few simple things. Disk are slow. Databases are slow. Memory, hashes, arrays (perhaps) are fast. Pull everything into memory, hash or array structures depending. Use hash tables to index the interactions. Process. Get results. Write to disk at the end. If you can't pull everyting into memory get more memory. Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side.

    If you can post some exapmle data/and or code that illustrates the problem I am sure it will be possible to make if much faster.



Re: Iteration speed
by davidj (Priest) on Jun 15, 2004 at 17:31 UTC
    There could be several things causing slow iteration time:

    1) data structures used (some are more efficient than others)
    2) the algorithm used for determining interations (is it the most efficient?).
    3)how the residuals are being parsed and compared
    4) how you are reading in the file (if you have a new file read for each residue it will bottleneck)
    5) how you are storing information in the database.

    These are just off the top of my head. No doubt there are others. Give us a better idea of what you are doing (especially addressing the above) and it will be easier to help you.

Re: Iteration speed
by dragonchild (Archbishop) on Jun 15, 2004 at 17:32 UTC
    At some point, it becomes more efficient to purchase another machine than it is to pay someone to improve the code. Additionally, the kind of processing you're talking about isn't going to happen very quickly, no matter what you do, on a PII-400.

    A very big help here would be to see some of your code. There are a number of improvements that can be made to any code. Some of these improvements might be to drop out of Perl and into C.

    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: Iteration speed
by pbeckingham (Parson) on Jun 15, 2004 at 17:33 UTC

    On the whole, I think we need more information. But here are some vague questions:

    • You have taken on an N^2 problem. Most of the work is probably going to be spent interating and comparing. As such, threading is not going to buy you that much. Is this is a multi-CPU box?
    • Is the list in memory in a raw or processed state? That is, does it need parsing/evaluating/decoding every time, or is that done once? What I'm trying to ask is: is the list in memory in an access-efficient structure? A little processing time up-front may make yield speedier access.
    • Perhaps you can eliminate half of the comparisons by somehow marking the data once you have done AB to prevent BA, as you suggested.
    • Silly question maybe, but is there a faster machine with more memory around to use? Or more CPUs?
    • Once your algorithm is good, PerlMonks will be able to improve the implementation of it significantly.

      I should add that I am running this script on a Dell PowerEdge server, with 3GB RAM, and 2x2.4GHZ processors.


Re: Iteration speed
by TrekNoid (Pilgrim) on Jun 15, 2004 at 19:08 UTC
    You wrote that:

    The way my program runs, is that, while it reads in the file, for every new residue it reads in, it iterates through the list currently in memory (avoiding residues in the same chain of course) to look for new interactions

    A lowly novice takes a brave step in offering advice... Not as a Perl guru (haven't been writing it long enough to qualify for that lofty title... just as someone who works with clinical data a bit, and learned Perl while doing it...

    One thing that instantly popped into my mind was to consider using associative arrays, rather than having to 'iterate through a list in memory'.

    Since you have a large set of potential, non-sequential items to compare, you can save a great deal of time by not having to bother searching through a list, and just index the associative array by the residue types.

    Maybe set up the residue array, and just concatenate new residue interactions as you come across them, then write when you're done in one fell swoop?

    I can't offer anything else off the top of my head... code sample would definately help, though.


Re: Iteration speed
by tachyon (Chancellor) on Jun 15, 2004 at 19:47 UTC

    Here is some sample code to get you started

    my @aminos = qw( A R N D C E Q G H I L K M F P S T W Y V ); # now A-R is the same as R-A in terms of interactions but if we # only use one we need a sort so use both, duplicate data but lose sor +t.... my @a_a; for my $a1(@aminos) { for my $a2(@aminos) { push @a_a, ["${a1}_${a2}", (join '', sort $a1,$a2)]; } } # now prepare our lookup table for interactions print "my \%interactions = (\n"; for my $a_ref( sort{ $a->[1] cmp $b->[1] }@a_a ) { print "\t",$a_ref->[0], " => 0,\n"; } print ");\n"; # this will give you a hash like this (except bigger) # fill in the interaction values (note A-C and C-A are adjacent # for convenience of filling these in. this is a do once my %interactions = ( A_A => 0, A_C => 1, C_A => 1, A_D => 2, D_A => 2, E_A => 0, A_E => 0, ); # next read in the sequences. my @chains; for my $file(@files) { open F, $file; my @residues = <F>; chomp @residues; push @chains, \@residues; } # fake it @chains = ( [ qw( A C D ) ],[ qw( A C D ) ],[ qw( A C D ) ], [ qw( A C D ) ],[ qw( D D D ) ],[ qw( E E E ) ],); # we now have an array of arrays. top level is chain, next level resid +ue my $interact; for my $ca_i( 0.. $#chains-1 ) { for my $cb_i( $ca_i+1 .. $#chains ) { print "Comparing chain $ca_i to $cb_i\n"; for my $res_a( @{$chains[$ca_i]} ) { for my $res_b( @{$chains[$cb_i]} ) { $interact = $res_a . '_' .$res_b; print "\t$interact => $interactions{$interact}\n"; } } } }

    Using simulated data this algorithm does 3.75 million interaction mappings in under a minute. Including writing it all to a file.

Re: Iteration speed
by meetraz (Hermit) on Jun 15, 2004 at 17:24 UTC
    Can you post pseudo-code? It's hard to determine where the bottlenecks might be, without seeing the structure.

      Dear all,

      Though this is a reply to meetraz, it involves having read the next 10 or so replies too.

      I'm now posting an example of the file data that I use, and the pseudo-algorithm

      But before I do so, I want to say that probably the best reply of the lot came from 'toma' I will truly go away and study computational geomtry now, thanks toma!


      ATOM 1 N GLY A 1 43.202 54.570 86.432 1.00 44.15 + N ATOM 2 CA GLY A 1 44.109 54.807 85.249 1.00 45.94 + C ATOM 3 C GLY A 1 44.984 56.034 85.443 1.00 43.20 + C ATOM 4 O GLY A 1 45.070 56.527 86.578 1.00 46.54 + O ATOM 5 N SER A 2 45.617 56.538 84.378 1.00 38.14 + N ATOM 6 CA SER A 2 46.461 57.710 84.544 1.00 33.00 + C ATOM 7 C SER A 2 46.057 59.017 83.842 1.00 29.38 + C ATOM 8 O SER A 2 46.522 60.071 84.270 1.00 32.21 + O ATOM 9 CB SER A 2 47.972 57.377 84.387 1.00 29.10 + C ATOM 10 OG SER A 2 48.326 56.931 83.084 1.00 25.44 + O ATOM 11 N HIS A 3 45.172 59.000 82.838 1.00 24.08 + N ATOM 12 CA HIS A 3 44.778 60.271 82.173 1.00 20.59 + C ATOM 13 C HIS A 3 43.299 60.425 81.787 1.00 18.87 + C ATOM 14 O HIS A 3 42.664 59.463 81.375 1.00 16.14 + O ATOM 15 CB HIS A 3 45.597 60.514 80.897 1.00 17.13 + C ATOM 16 CG HIS A 3 47.060 60.648 81.145 1.00 20.02 + C ATOM 17 ND1 HIS A 3 47.596 61.677 81.887 1.00 20.09 + N ATOM 18 CD2 HIS A 3 48.099 59.855 80.797 1.00 21.10 + C ATOM 19 CE1 HIS A 3 48.904 61.516 81.989 1.00 19.86 + C ATOM 20 NE2 HIS A 3 49.233 60.417 81.333 1.00 18.86 + N
      The co-ordinates are indeed cartesian, the last three columns are meaningless.

      The other important columns are the second through to the sixth columns, in order:

      1. atom number
      2. atom name
      3. residue name
      4. chain id
      5. residue number

      I should add that each line is for an atom and not for a residue, though I was originally talking about residue iteration, I use residue iteration to try and avoid any extra atomic iteration, but in reality, the number of lines is about 10-20 times the number of residues themselves, depending on the residue type

      The pseudo-code, which I'l post below, will ignore the format listed above, because each line is for one atom, and I have a coded PDB::Atom object (home-made) that takes the line as it's read, and parses it.

      For the sake of making the psuedo-code concise, it is indeed 'PSEUDO' so please dont expect it to run at all!

      The function 'addToMemory' simply fills the different hashes I use for lookup, especially to recall all the atoms in a residue. It is in there also, that the atomic data is added to the database, so there is one DB call per line in the file, this is one bottleneck, but much less of a bottleneck than the sheer number of iterations themselves

      I try to cut down on the iterations, by pre-calculating the 3D center of the residue, and then comparing the distance of a residue pair to a hard-coded cut-off (varies depending on residues themselves) in 'notClose' (not shown). This avoids having to iterate through all the atoms in a residue, if there is no chance of a bond.

      Finally the bond detection itself is in another function, not shown, and doesn't necessarily return a bond, there is more calculations depending on the nature of the atom itself. I just wanted to show the nature of the iterations themselves.

      It should be noted that there is a large amount of processing for particular residues and atoms due to possible errors in the file, which I've excluded.

      open (FILE, "< ".$file ) or die "\ncouldn't open FILE: ".$file.": $! +"; my $i=0; while(<FILE>){ if($_ =~ /^ATOM/){ $temp = new PDB::Atom('-line' => $_, '-number'=>$i); addToMemory($self, $temp, $i); ########bond detection ################# ###iterate through chains### foreach my $ch (sort {$a <=> $b} keys %{$self->{'contacts'}){ next if $temp->chain eq $ch; #skip same chains ###iterate through residues### foreach my $r (sort {$a <=> $b} keys %{$self->{'contacts'}{$ch +}}){ next if notClose($temp,self->{'contacts'}{$ch}{$r}); foreach my $a (sort {$a <=> $b} keys %{$self->{'residues'} +{$ch.$r}}){ $bonds->newBond($temp,$self->{'all'}{$ +a}); } } } } } close(FILE); sub addToMemory{ my ($self, $atom, $numb) = @_; $self->{'all'}{$numb}=$atom; unless($atom->proton){ $self->{'atoms'}{$numb} = 1; }else{ $self->{'protons'}{$numb} = 1; } $self->{'contacts'}{$atom->chain}{$atom->resNumber}=residualAverag +e($atom, $self->{'contacts'}{$atom->chain}{$atom->resNumber}); $self->{'residues'}{$ch.$atom->resNumber}{$numb}=1; }

      Edited by Chady -- converted pre to code tags.

        Sorry, but your pseudo code is just a little to pseudo. Without a clear understanding of the internals of the PDB::Atom objects and it's methods, I find it impossible to understand what is going on.

        One casual comment though. In general, perl's objects carry a considerable performance penalty relative to it's standard hashes and array's. Iterating large volumes of data represented as objects will be much slower than processing that same data stored in a hash or an array.

        This is no surprise, nor a critisism of Perl. Just s statement of fact. The extra levels of indirection and lookup are bound to carry a penalty, but when dealing with large volumes where speed is a criteria, it is best avoided.

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon

        dear all

        I've started a 'profile' on one of my biggest files (15 chains!) and here's the most telling results:

        %Time    Sec.     #calls   sec/call  F  name
        17.79 1043.2695  10727535   0.000097     PDB::Bonds::notClose
        14.88  872.6432  14509596   0.000060     PDB::Writer::numberFormat
        10.52  616.7291  14509596   0.000043     PDB::Bond::dist
         6.89  403.8085  14509597   0.000028     PDB::Writer::pad_left
         6.40  375.5808        1  375.580791  ?  HighPDB::parse
         6.39  374.5854  14509596   0.000026     PDB::Writer::pad_right
         5.54  325.1066  43586508   0.000007     UNIVERSAL::isa
         4.89  286.5572  1881489   0.000152     PDB::Bond::new
         3.49  204.5796  18291657   0.000011     PDB::Atom::x
         3.42  200.8238  18291657   0.000011     PDB::Atom::y
         3.23  189.6586  18291657   0.000010     PDB::Atom::z
         3.14  184.3266  1881489   0.000098     PDB::Bond::isHydphb
         3.14  184.2096  1880381   0.000098     PDB::Bond::isElcsta
         2.24  131.0808        1  131.080769     WhatIf::doWhatif
         1.91  111.7546  10730691   0.000010     PDB::Atom::resName
        The code for the first three subroutines are shown here:

        sub notClose{ my $self=shift; my ($a1,$a2,$m)=@_; my %hash = %$a2; my $cutoff = $lengths{$a1->resName} + 7 + $hash{'r'}; my $dist = PDB::Bond->dist($a1->x,$hash{'x'},$a1->y,$hash{'y'},$a1 +->z,$hash{'z'}); my $value = $dist<$cutoff ? 0 : 1; return $value; } sub dist{my $self=shift if UNIVERSAL::isa($_[0] => __PACKAGE__); my ($x1, $x2, $y1, $y2, $z1, $z2)=@_; return numberFormat(sqrt ( ($x1 - $x2)**2 + ($y1 - $y2)**2 + ($z1 - $z2)**2 ), 1,2); } sub numberFormat{ my( $number, $whole, $frac ) = @_; return pad_left('0',$whole,'0').'.'.pad_right('0',$frac,'0')if $nu +mber == 0; return pad_left($number,$whole,'0') unless $number =~ /\./ || $fra +c; my ($left,$right); ($left,$right) = split /\./, $number; $left = pad_left($left, $whole, '0'); if(defined $right){ $right = pad_right( substr($right,0,$frac), $frac, '0' ); return "$left\.$right"; }else{ $right = pad_right( '0', $frac, '0'); return "$left\.$right"; } } sub pad_left { my $self=shift if UNIVERSAL::isa($_[0] => __PACKAGE__); my ($item, $size, $padding) = @_; my $newItem = $item; $padding = ' ' unless defined $padding; while( length $newItem < $size ) { $newItem = "$padding$newItem"; } return $newItem; } sub pad_right { my $self=shift if UNIVERSAL::isa($_[0] => __PACKAGE__); my ($item, $size, $padding) = @_; my $newItem = $item; $padding = ' ' unless defined $padding; while( length $newItem < $size ) { $newItem .= $padding; } return $newItem; }

        I had totally forgotten that I use numberFormat to manipulate the result of the sqrt function. (This is essential for the DB) I'm now going to move this to the DB part, so that it only gets called when adding 'real' bonds to the DB.

        I'm also going to remove the UNIVERSAL::isa calls, and just try to ASSUME $self whenever I can.

        Thanks for all the help, and I'm still investigating mr. Delauney.


Re: Iteration speed
by amw1 (Friar) on Jun 15, 2004 at 17:41 UTC
    also, what does the input data look like? A possible place to look for speedups is to see if there's anyway you can structure your data in memory that will make it faster to determine if there is a match. (trees come to mind)

    i.e. (I have no idea if this has anything to do with genetic data but the example below does :)

    say you have

    gat gaa gac gtc tga ctg
    try creating a structure that looks like
    $struct->{g}{a}{t} = 1; $struct->{g}{a}{a} = 1; $struct->{g}{a}{c} = 1; $struct->{g}{t}{c} = 1; $struct->{t}{g}{a} = 1; $struct->{c}{t}{g} = 1; you can then drop out pretty quickly as soon as there isn't a possible + completion.
    If this has nothing to do with your question ignore me. I don't know a thing about protine residue sequence-a-go-go so I'm taking a stab in the dark.
Re: Iteration speed
by toma (Vicar) on Jun 16, 2004 at 07:14 UTC
    If I understand your question, by 'coordinates' you mean 'Cartesian coordinates', perhaps in two or three dimensions. The problem is that you have N points and you want to find the points that are close enough together to have some sort of interaction.

    From your description, you are making (N**2-N)/2 comparisons to find the points that are 'close enough' to require further calculations.

    If I have properly interpreted your problem, then I have an algorithm suggestion. It turns out that there is a very clever algorithm that will reduce the number of comparisons required to a very small number, perhaps something like 3*N. The algorithm you need can be found by searching for the words 'Voronoi' and 'Delaunay Triangulation.' These algorithms are most often described on two dimensional data, but I have heard rumors of success with this approach in up to six dimensions. Certainly three dimensional implementations are possible.

    I do not know of any perl implementations of these algorithms. Perhaps someone else knows of one.

    The field of study is called 'Computational Geometry.' I would try to describe the algorithms here, but geometry without pictures isn't much fun!

    It should work perfectly the first time! - toma
Re: Iteration speed
by BrowserUk (Patriarch) on Jun 15, 2004 at 22:42 UTC

    Describing your problem in terms that only another biochemist will understand means that most of us here will only be able to guess at what your program needs to do.

    • What do co-ordinates for mulit-chain proteins look like?
    • What does an interaction between each of the residues look like and how do you calculate it?

    The best way to speed up iteration is to avoid iterating. Lookups are fast.

    If your dataset is too large to fit in memory forcing you to re-read files, then the first pass I would make is to avoid having to re-parse the files each time. A pre-processing step that parses your files into convenient data structures and then writes these to disk in packed or Storable binary format would probably speed up the loading considerably.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
      Oh, there's a few of us around :)

      The problem, as noted by others, is that we can't see his code to make suggestions. Shrug. Can't help much there. He doesn't even say if he's using the Perl bioinformatics modules or if he's rolled his own.

      In any case though, this is a problem that is begging for a parallel processing solution. In general, I'd recommend he break up the dataset and run it on all the machines in the lab. I doubt that there are many algorythmic improvements that can beat adding another 5 CPUs to the task.

      I didn't believe in evil until I dated it.

        I know there are a few of you guys around, but the description left me (and a few others from the responses) completely cold :)

        Belatedly, I have begin to think that this problem is related to a previous thread. If that is the case, I think that an algorithmic approach similar to that I outlined at Re: Re: Re: Processing data with lot of math... could cut the processing times to a fraction of a brute force iteration. As I mentioned in that post, my crude testing showed that by limiting the comparisons to a fraction of the possibles using an intelligent search I can process 100,000 coordinates and find 19000 matching pairs in around 4 minutes without trying very hard to optimise.

        I agree that a distributed search would perform the same task more quickly but the additional complexity of setting up that kind of system is best avoided if it can be. And if this is the same problem, that is easily possible. My test code from the previous thread is under 60 lines including the benchmarking.

        What stops me offering that code here is 1) A clear decription of the problem. 2) Some real test data.

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: Iteration speed
by rir (Vicar) on Jun 15, 2004 at 17:34 UTC
    Perhaps you can generate your residues and sort them so that you can use part of the work from the previous residue pair.

    Individual writes are probably not helping.

    I suggest you profile your program with a small data set and see where the problems really are.

Re: Iteration speed
by Anonymous Monk on Jun 16, 2004 at 15:43 UTC
    First I have to mention that you are working with sets. So you can treat each interaction agents (proteine and residue) as part of interaction set. So basicaly it goes down to 6*500 = 3000 agents for your task. doing mappping protein->ProteinSet and residue ->ResidueSet you can create hash h{memberOfProteinSet}{memberOfResidueSet} =\@ArrayOfInteractionResults[numberOfInteraction]->\%hashOfIteractionResults. If you want to count all iteractions.If you want to hold values between chain interactions again map it to set and you have a structure of 15*14*3000=630000 which still is smaller than 15*250000=3500000.

    Edited by Chady -- added code tags.

Re: Iteration speed
by dba (Monk) on Jun 15, 2004 at 18:47 UTC
    Sometimes, if we don't have a right tool, no matter how efficient we try to become it may not be possible.
    I would seriously consider a well developed RDBMS (like oracle). They have invested several thousand man years of research to optimize joins and sorts and may not be worth to reinvent the wheel.
    There are specific perl books written on this subject. (I have not read the book, so can't comment)

      A RDBMS is almost never the solution where the problem reeks of algorithm.



        The context in which I mentioned was that RDBMS companies have been refining join/sort algorithms for years and be more useful to just use it, rather than refine/redefine again.
Re: Iteration speed
by Jasper (Chaplain) on Jun 18, 2004 at 12:53 UTC
    but through every possible residue pair (though not residues in the same chain of course) and seeing if they are close enough

    Probably a stupid question, and not perl related, but why not look at interactions from the same chain? Surely they are as important as inter-chain interactions?
      It is true that intrA-chain interactions can be important, especially in determining the secondary structure of the chain (ie whether it becomes a helix, or a 'sheet' etc.) but this topic is very well covered and any new data would provide little extra information.


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://366950]
Approved by Old_Gray_Bear
Front-paged by broquaint
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2022-05-20 20:12 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (76 votes). Check out past polls.