seaver has asked for the wisdom of the Perl Monks concerning the following question:
My problem is a bioinformatics problem, I'm currently running a script that is processing 13,000 files which contain co-ordinates for mulit-chain proteins
It calculates the interaction between each of the residues, between each chain (if there are any interactions).
Thus, it is iterating not only through each chain pair, but through every possible residue pair (though not residues in the same chain of course) and seeing if they are close enough, and determining the kind of interaction if so.
My problem, which I fear is unavoidable, as it is compounded by the fact that I cannot avoid to MISS any possible interactions, is that some of the larger files take hours, even more than a day, to process.
At this rate, it can take nearly a year, to go through all the files, which is unfortunate.
So take for example, a 6 chain protein, with approx 3000 residues, that's approx 500 residues per chain. So in one chain pair, there's 500x500 iterations, which is 250,000 iterations, and because there's 6 chains, that's 15 possible chain pairs (avoiding repeats eg: AB == BA) so thats .25 shy of 4 million iterations.
I just wanted to know, what are the potential bottlenecks? One such file (larger than the above example) is still being processed after 1.5 days!
The way my program runs, is that, while it reads in the file, for every new residue it reads in, it iterates through the list currently in memory (avoiding residues in the same chain of course) to look for new interactions, and at the same time, is populating a database with the atomic and residual details, and the interactions if any. Is this a stupid way of doing it?
Cheers
Sam
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Iteration speed
by tachyon (Chancellor) on Jun 15, 2004 at 18:50 UTC | |
4 million iteration is nothing much in the scheme of things. I did a little finacial brute force matching problem that did 3.5 million in under a minute with about 50 lines of munging in the loops. Gererally this sort of problem comes down to a few simple things. Disk are slow. Databases are slow. Memory, hashes, arrays (perhaps) are fast. Pull everything into memory, hash or array structures depending. Use hash tables to index the interactions. Process. Get results. Write to disk at the end. If you can't pull everyting into memory get more memory. Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side. If you can post some exapmle data/and or code that illustrates the problem I am sure it will be possible to make if much faster. cheers tachyon | [reply] |
Re: Iteration speed
by davidj (Priest) on Jun 15, 2004 at 17:31 UTC | |
1) data structures used (some are more efficient than others) 2) the algorithm used for determining interations (is it the most efficient?). 3)how the residuals are being parsed and compared 4) how you are reading in the file (if you have a new file read for each residue it will bottleneck) 5) how you are storing information in the database. These are just off the top of my head. No doubt there are others. Give us a better idea of what you are doing (especially addressing the above) and it will be easier to help you. davidj | [reply] |
Re: Iteration speed
by dragonchild (Archbishop) on Jun 15, 2004 at 17:32 UTC | |
A very big help here would be to see some of your code. There are a number of improvements that can be made to any code. Some of these improvements might be to drop out of Perl and into C. ------
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested | [reply] |
Re: Iteration speed
by pbeckingham (Parson) on Jun 15, 2004 at 17:33 UTC | |
On the whole, I think we need more information. But here are some vague questions: | [reply] |
by seaver (Pilgrim) on Jun 16, 2004 at 14:39 UTC | |
Cheers | [reply] |
Re: Iteration speed
by TrekNoid (Pilgrim) on Jun 15, 2004 at 19:08 UTC | |
The way my program runs, is that, while it reads in the file, for every new residue it reads in, it iterates through the list currently in memory (avoiding residues in the same chain of course) to look for new interactions A lowly novice takes a brave step in offering advice... Not as a Perl guru (haven't been writing it long enough to qualify for that lofty title... just as someone who works with clinical data a bit, and learned Perl while doing it... One thing that instantly popped into my mind was to consider using associative arrays, rather than having to 'iterate through a list in memory'. Since you have a large set of potential, non-sequential items to compare, you can save a great deal of time by not having to bother searching through a list, and just index the associative array by the residue types. Maybe set up the residue array, and just concatenate new residue interactions as you come across them, then write when you're done in one fell swoop? I can't offer anything else off the top of my head... code sample would definately help, though. Trek | [reply] |
Re: Iteration speed
by tachyon (Chancellor) on Jun 15, 2004 at 19:47 UTC | |
Here is some sample code to get you started
Using simulated data this algorithm does 3.75 million interaction mappings in under a minute. Including writing it all to a file. Read more... (1475 Bytes) | [reply] [d/l] [select] |
Re: Iteration speed
by meetraz (Hermit) on Jun 15, 2004 at 17:24 UTC | |
| [reply] |
by seaver (Pilgrim) on Jun 16, 2004 at 14:32 UTC | |
Though this is a reply to meetraz, it involves having read the next 10 or so replies too. I'm now posting an example of the file data that I use, and the pseudo-algorithm But before I do so, I want to say that probably the best reply of the lot came from 'toma' I will truly go away and study computational geomtry now, thanks toma! File: The co-ordinates are indeed cartesian, the last three columns are meaningless. The other important columns are the second through to the sixth columns, in order:
I should add that each line is for an atom and not for a residue, though I was originally talking about residue iteration, I use residue iteration to try and avoid any extra atomic iteration, but in reality, the number of lines is about 10-20 times the number of residues themselves, depending on the residue type The pseudo-code, which I'l post below, will ignore the format listed above, because each line is for one atom, and I have a coded PDB::Atom object (home-made) that takes the line as it's read, and parses it. For the sake of making the psuedo-code concise, it is indeed 'PSEUDO' so please dont expect it to run at all! The function 'addToMemory' simply fills the different hashes I use for lookup, especially to recall all the atoms in a residue. It is in there also, that the atomic data is added to the database, so there is one DB call per line in the file, this is one bottleneck, but much less of a bottleneck than the sheer number of iterations themselves I try to cut down on the iterations, by pre-calculating the 3D center of the residue, and then comparing the distance of a residue pair to a hard-coded cut-off (varies depending on residues themselves) in 'notClose' (not shown). This avoids having to iterate through all the atoms in a residue, if there is no chance of a bond. Finally the bond detection itself is in another function, not shown, and doesn't necessarily return a bond, there is more calculations depending on the nature of the atom itself. I just wanted to show the nature of the iterations themselves. It should be noted that there is a large amount of processing for particular residues and atoms due to possible errors in the file, which I've excluded.
Edited by Chady -- converted pre to code tags. | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jun 16, 2004 at 18:32 UTC | |
Sorry, but your pseudo code is just a little to pseudo. Without a clear understanding of the internals of the PDB::Atom objects and it's methods, I find it impossible to understand what is going on. One casual comment though. In general, perl's objects carry a considerable performance penalty relative to it's standard hashes and array's. Iterating large volumes of data represented as objects will be much slower than processing that same data stored in a hash or an array. This is no surprise, nor a critisism of Perl. Just s statement of fact. The extra levels of indirection and lookup are bound to carry a penalty, but when dealing with large volumes where speed is a criteria, it is best avoided. | [reply] |
by seaver (Pilgrim) on Jun 16, 2004 at 19:35 UTC | |
dear all I've started a 'profile' on one of my biggest files (15 chains!) and here's the most telling results:
%Time Sec. #calls sec/call F name 17.79 1043.2695 10727535 0.000097 PDB::Bonds::notClose 14.88 872.6432 14509596 0.000060 PDB::Writer::numberFormat 10.52 616.7291 14509596 0.000043 PDB::Bond::dist 6.89 403.8085 14509597 0.000028 PDB::Writer::pad_left 6.40 375.5808 1 375.580791 ? HighPDB::parse 6.39 374.5854 14509596 0.000026 PDB::Writer::pad_right 5.54 325.1066 43586508 0.000007 UNIVERSAL::isa 4.89 286.5572 1881489 0.000152 PDB::Bond::new 3.49 204.5796 18291657 0.000011 PDB::Atom::x 3.42 200.8238 18291657 0.000011 PDB::Atom::y 3.23 189.6586 18291657 0.000010 PDB::Atom::z 3.14 184.3266 1881489 0.000098 PDB::Bond::isHydphb 3.14 184.2096 1880381 0.000098 PDB::Bond::isElcsta 2.24 131.0808 1 131.080769 WhatIf::doWhatif 1.91 111.7546 10730691 0.000010 PDB::Atom::resNameThe code for the first three subroutines are shown here:
I had totally forgotten that I use numberFormat to manipulate the result of the sqrt function. (This is essential for the DB) I'm now going to move this to the DB part, so that it only gets called when adding 'real' bonds to the DB. I'm also going to remove the UNIVERSAL::isa calls, and just try to ASSUME $self whenever I can. Thanks for all the help, and I'm still investigating mr. Delauney.
Cheers | [reply] [d/l] |
by BrowserUk (Patriarch) on Jun 16, 2004 at 20:35 UTC | |
Re: Iteration speed
by amw1 (Friar) on Jun 15, 2004 at 17:41 UTC | |
i.e. (I have no idea if this has anything to do with genetic data but the example below does :) say you have try creating a structure that looks like If this has nothing to do with your question ignore me. I don't know a thing about protine residue sequence-a-go-go so I'm taking a stab in the dark. | [reply] [d/l] [select] |
Re: Iteration speed
by toma (Vicar) on Jun 16, 2004 at 07:14 UTC | |
From your description, you are making (N**2-N)/2 comparisons to find the points that are 'close enough' to require further calculations. If I have properly interpreted your problem, then I have an algorithm suggestion. It turns out that there is a very clever algorithm that will reduce the number of comparisons required to a very small number, perhaps something like 3*N. The algorithm you need can be found by searching for the words 'Voronoi' and 'Delaunay Triangulation.' These algorithms are most often described on two dimensional data, but I have heard rumors of success with this approach in up to six dimensions. Certainly three dimensional implementations are possible. I do not know of any perl implementations of these algorithms. Perhaps someone else knows of one. The field of study is called 'Computational Geometry.' I would try to describe the algorithms here, but geometry without pictures isn't much fun!
It should work perfectly the first time! - toma
| [reply] |
Re: Iteration speed
by BrowserUk (Patriarch) on Jun 15, 2004 at 22:42 UTC | |
Describing your problem in terms that only another biochemist will understand means that most of us here will only be able to guess at what your program needs to do. The best way to speed up iteration is to avoid iterating. Lookups are fast. If your dataset is too large to fit in memory forcing you to re-read files, then the first pass I would make is to avoid having to re-parse the files each time. A pre-processing step that parses your files into convenient data structures and then writes these to disk in packed or Storable binary format would probably speed up the loading considerably. | [reply] |
by jepri (Parson) on Jun 16, 2004 at 13:05 UTC | |
The problem, as noted by others, is that we can't see his code to make suggestions. Shrug. Can't help much there. He doesn't even say if he's using the Perl bioinformatics modules or if he's rolled his own. In any case though, this is a problem that is begging for a parallel processing solution. In general, I'd recommend he break up the dataset and run it on all the machines in the lab. I doubt that there are many algorythmic improvements that can beat adding another 5 CPUs to the task.
___________________ | [reply] |
by BrowserUk (Patriarch) on Jun 16, 2004 at 13:23 UTC | |
I know there are a few of you guys around, but the description left me (and a few others from the responses) completely cold :) Belatedly, I have begin to think that this problem is related to a previous thread. If that is the case, I think that an algorithmic approach similar to that I outlined at Re: Re: Re: Processing data with lot of math... could cut the processing times to a fraction of a brute force iteration. As I mentioned in that post, my crude testing showed that by limiting the comparisons to a fraction of the possibles using an intelligent search I can process 100,000 coordinates and find 19000 matching pairs in around 4 minutes without trying very hard to optimise. I agree that a distributed search would perform the same task more quickly but the additional complexity of setting up that kind of system is best avoided if it can be. And if this is the same problem, that is easily possible. My test code from the previous thread is under 60 lines including the benchmarking. What stops me offering that code here is 1) A clear decription of the problem. 2) Some real test data. | [reply] |
Re: Iteration speed
by rir (Vicar) on Jun 15, 2004 at 17:34 UTC | |
Individual writes are probably not helping. I suggest you profile your program with a small data set and see where the problems really are. | [reply] |
Re: Iteration speed
by Anonymous Monk on Jun 16, 2004 at 15:43 UTC | |
Edited by Chady -- added code tags. | [reply] [d/l] |
Re: Iteration speed
by dba (Monk) on Jun 15, 2004 at 18:47 UTC | |
I would seriously consider a well developed RDBMS (like oracle). They have invested several thousand man years of research to optimize joins and sorts and may not be worth to reinvent the wheel. There are specific perl books written on this subject. (I have not read the book, so can't comment) | [reply] |
by tachyon (Chancellor) on Jun 15, 2004 at 18:59 UTC | |
A RDBMS is almost never the solution where the problem reeks of algorithm. cheers tachyon | [reply] |
by dba (Monk) on Jun 16, 2004 at 13:20 UTC | |
| [reply] |
Re: Iteration speed
by Jasper (Chaplain) on Jun 18, 2004 at 12:53 UTC | |
Probably a stupid question, and not perl related, but why not look at interactions from the same chain? Surely they are as important as inter-chain interactions? | [reply] |
by seaver (Pilgrim) on Jun 18, 2004 at 16:58 UTC | |
Cheers | [reply] |