qhayaal has asked for the wisdom of the Perl Monks concerning the following question:
Hello,
I have thousands of files, each with thousands of entries. I need to process these files to do the following. Each line in any file has Name, X, Y, Z of an atom. Read each file and find all the pairs of atoms that are within a cutoff distance. I could do it, but the problem was it works too slow!
This was the sequence of steps I followed.
1. Read each file into an array directly.
2. Store X,Y,Z in three arrays.
3. With nested loops, find the pairs within cutoff distance.
What I would like to know is:
A. Is there a way of significantly improving the speed in Perl that involves this kind of math? If not,
B. If I have a fortran code (subroutine) that reads three arrays and then give me output, can I use the object file generated by that program to plugin to Perl?
Thanks in advance for any tips.
Have a nice day.
Re: Processing data with lot of math...
by swngnmonk (Pilgrim) on May 12, 2004 at 14:48 UTC

Sounds to me like your first problem isn't the fact that Perl is slow. Your first problem is that you have no idea whether or not your approach for processing all this data is efficient or not!
You're describing a pretty classic Graph Theory problem  taking a large set of points in space (3dimensional space in this case), and finding the subset of points that are within a certain distance of eachother. Go find a good book on graph theory (Introduction to Algorithms is a great one), and make sure your approach is sound.
Only then should you start worrying about whether or not perl is fast enough or not.
 [reply] 
Re: Processing data with lot of math...
by BrowserUk (Patriarch) on May 12, 2004 at 17:07 UTC

I think that it is possible to do this with one loop, and almost no math!
However, I could be missing something, so read, digest, cogitate and draw your own conclusions.
For example, if you have an atom located at point ( 10, 10, 10 ), and the cutoff distance is 1 unit, then any atom that will be paired with this atom will have to be somewhere within an box defined by the points
p( 09, 09, 09 )
p( 09, 11, 09 )
p( 09, 11, 11 )
p( 09, 11, 11 )
p( 09, 11, 11 )
p( 11, 09, 09 )
p( 11, 09, 11 )
p( 11, 11, 11 )
So, instead of storing the x:y:z in separate arrays, or some other complex data structure, create a single array, indexed by the concatentations of x:y:z, (with leading zeros).
$atoms[ 090909 ] = 'name1';
$atoms[ 091109 ] = 'name2';
$atoms[ 091111 ] = 'name3';
$atoms[ 091111 ] = 'name4';
$atoms[ 091111 ] = 'name5';
$atoms[ 110909 ] = 'name6';
$atoms[ 110911 ] = 'name7';
$atoms[ 111111 ] = 'name8';
Once you have created this (sparse) array, it becomes very easy to pick out the limited set of possible pairings for any given point. It requires a final math check to exclude those point lying in the corners of the box, but the number of calculations should be significantly less than the brute force method.
There are some problems with this.
 I'm using integer coordinates, yours may well not be.
If so, I would use a scaling factor upon each of the three coordinates to bring them to some manageable range of integers, and store the real coordinates along with the name. I would do this as a single string rather than as an array ref, as it costs very little extra memory to store a longer scalar in an array element, but creating thousands of small arrays costs lots.
 Depending upon how easily your coordinates map to integer ranges, and the size of those ranges, the size of the array will likely be enormously sparse, and too big for memory.
In which case it makes sense to use a hash to implement a sparse array. The downsides of this is that the hash takes substantially more memory, and you then need to sort the hash keys prior to processing them.
Depending upon how meany thousands of atoms we are talking about, this could well lead you to requiring more memory than is available, or even possible (assuming a 32bit processor).
At this point the solution becomes to process the files and create a single output file where each record is of the form
xxyyzz name x.xx y.yy z.zz
A single pass to create the file. Then sort the file (using your systems sort utility) and the a single pass to read the sorted output and do the final comparisons and write out the pairs in whatever format you need them.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"Think for yourself!"  Abigail
 [reply] [d/l] [select] 

Hey, this approach works really well! I'll include it in my next version of Chemistry::Bond::Find. It scales almost linearly, maybe even better than the recursive partitioning algorithm as I implemented it, and it avoids some of the overhead of the recursive calls and the creation and passing of lots of data structures.
For my test file with 1717 atoms, my recursive algorithm took about 9.7 seconds, while the new method based on your post took 3.2 seconds.
I should remember that you can throw hashes at almost any problem with Perl...
 [reply] 

That's good. Unfortunately, anything involving recursion in perl tends to hit performance quite badly once the volumes of data rise.
If the mapping to integer coords is working okay, then you could (potentially) save a bit more time on large volumes of data by concatenating the pack'd integers rather than the ascii representations.
$hash{ join'', pack 'N3', $x, $y, $z } = 'name x.xx y.yy z.zz';
This would allow you to do an alpha sort rather than a numeric one, and avoids the need to add leading zeros to the values. It might also conserve a little space, though you are unlikely to see this unless your resultant concatenated ascii values are greater than 12 characters.
You'd need to benchmark some typical datasets to see what if any benefit this would have, and where an transitions occured.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"Think for yourself!"  Abigail
 [reply] [d/l] 



$x[l][m][n][i]=Xvalue of i'th atom in the box
$y[l][m][n][i]=Yvalue of i'th atom in the box
$z[l][m][n][i]=Zvalue of i'th atom in the box
Then all I would need to do is to loop over all boxes (using @box) and checking existence of it's neighbouring boxes (I would need a hash here, I guess... Emmm..., box should be a hash and I can use its values in the above), and for all those exists I can loop over their atoms.
Does this make sense or sound better?
 [reply] [d/l] 

Okay. From this post I think that my original post did not explain the method I was trying to outline very well. Also, looking back at itubs reply, he also seems not to have fully understood what I was getting at.
I've put together a crude simulation using some randomly generated data, and it seems that using my method will process:
10,000 atoms, in around 17 seconds using < 6 MB memory.
20,000 atoms, in around 112 seconds using < 11 MB
30,000 atoms, in around 350 seconds using < 14 MB.
From these crude tests, I think it would handle 100,000 atoms in around 2 1/2 hours using 40 MB ram.
Update Ignore the figures above. I discovered a stupidity in my testcase that meant I was doing way more work than I needed to. I can now processing 100,000 atoms using 50MB in under 4 minutes.
This is without any attempt to really optimise things. I have no real feel for how this compares with other methods as I don't know what other methods are used.
I don't know for sure that I am processing the data correctly.
If you have a smallish (1000 or 2000 atom) dataset (with the desired results for a specified cutoff), that you could make available to me somehow?
Posting this here is probably not a good idea:), but I could let you have my email id.
Then I could see if what I am doing even approximates to giving correct results before I go posting triumphantly.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"Think for yourself!"  Abigail
 [reply] 
Re: Processing data with lot of math...
by ColonelPanic (Friar) on May 12, 2004 at 14:43 UTC

 [reply] 

1. RAM problem.
RAM is 512MB and swap 1GB.
A file typically is 1.2MB
No. of lines (atoms) 20,000.
Note: This is in test phase, but it can increase by about 1050 fold.
2. Graph Theory problem.
This unfortunately is not a GraphTheory problem, but more of a gasphase problem. I need to find out all the 'interactingpairs' (ie pairs of atoms close enough) to do a more complicated analysis.
3. FORTRAN object file.
Ok, may be I can make this piece into a library file, so that I should be able to use it as a POSIX funtion?
4. Gridwise calculation.
Promising. :) I thought, but was a bit lazy to try. After bruteforce I began to wonder if it would be worth the effort... Thanks. :)
5. Chemistry::Bond::Find
Yes, I must try this. :) Mine's is a protein with helllot of water, and I have to find waterprotein hydrogen bonds. At least based on distance alone. Will get back to you. itub. :D
6. Using square of distance.
Well, yes, I was already using the square of the distance, and yes, it is on PDB files. :)
7. Using x:y:z boxinfo.
Yes. I thought of it, but was lazy as I thought I need to put in lot of code. But now I am convinced it won't be so much. Thanks a lot BrowserUk. :)
8. The code:
# Open each PDB
foreach my $pdb_file (<$pdb_list>) {
{
chomp($pdb_file);
my $tmp_file; # We would open the PDB with this handle
open($tmp_file, "< $pdb_file") or (die "Cannot open PDB: $
+!");
# Read X,Y,Z coordinates
my @X; # Xcoordinates
my @Y; # Ycoordinates
my @Z; # Zcoordinates
my @pdb_tmp=<$tmp_file>;
foreach (@pdb_tmp) {
if (substr($_,0,3) eq "ATO") {
push @X, substr($_,30,8);
push @Y, substr($_,38,8);
push @Z, substr($_,46,8);
}
}
# Find the interaction pairs. Also strore the best angle, sh
+ould it
# occur again with another proton. Also keep track of the wa
+ters
# that have made hbond with the solute atoms.
my @sel_wat; # Array of water molecule (numbers) selected
my %hbond; # $hbond[$tag1:$tag2][0]=distance
# $hbond[$tag1:$tag2][0]=angle
# Solute as donor and water as acceptor
{
my @atom_cov; # Polar solute atom that is already cover
+ed.
for (my $i=0; $i <= $#pol_h; $i++) {
# If this donor is not already covered, then go ahead.
if ( ! defined($atom_cov[$pol_h[$i][1]]) ) {
for (my $j=0; $j <= $#wat_a; $j++) {
my $dx=$X[$pol_h[$i][1]]$X[$wat_a[$j]];
my $dy=$Y[$pol_h[$i][1]]$Y[$wat_a[$j]];
my $dz=$Z[$pol_h[$i][1]]$Z[$wat_a[$j]];
my $distSq=($dx*$dx)+($dy*$dy)+($dz*$dz);
if ($distSq <= $hb_dist) {
$atom_cov[$pol_h[$i][1]]=1;
print $idty[$pol_h[$i][1]], " ",
$idty[$wat_a[$j]], $distSq,"\n";
}
}
}
}
}
}
}
 [reply] [d/l] 

Regarding the issue of finding Hbonds in PDB files:
Besides all the algorithmic suggestions already given, The problem may be even "smaller" than it looks. If your PDB file has water molecules and the protein labeled properly, you now have to consider only water atomprotein atom pairs, instead of every possible pair. And not even every protein atom, if you restrict your definition of Hbond to the typical O...H or N...H.
 [reply] 


2. Graph Theory problem.
This unfortunately is not a GraphTheory problem, but more of a gasphase problem. I need to find out all the 'interactingpairs' (ie pairs of atoms close enough) to do a more complicated analysis.
Your APPLICATION may be a gasanalysis problem, but the guts of it, finding points within a certain distance of a known point, is exactly a graphtheory problem.
 [reply] 

Re: Processing data with lot of math...
by AbigailII (Bishop) on May 12, 2004 at 14:49 UTC

Is there a way of significantly improving the speed in Perl that involves this kind of math?
Maybe, maybe not. You don't define how the "cutoff" distance is determined, so it's hard to say. Is it related to the lines where the atoms appear? Are X, Y, Z coordinates?
If I have a fortran code (subroutine) that reads three arrays and then give me output, can I use the object file generated by that program to plugin to Perl?
It's not clear what you mean here by "object file". Often, with "object file" we mean a file that can be linked into an executable, that is, the output of a compiler. But your description of how it's made suggests otherwise. If it contains data, then, yes, you can read it. You must know the structure of the data though to make sense out it  you may need unpack if the data is in some binary format.
Abigail
 [reply] 
Re: Processing data with lot of math...
by husker (Chaplain) on May 12, 2004 at 14:50 UTC

Before we can suggest an improvement to your algorithm, we need to see what you are doing now. I imagine you are just bruteforce calculating the distance from each atom to each other atom in your array and comparing that to your cutoff distance? The math there is pretty simple .. i don't think there's going to be a "trick" to calculating the distance between two points.
Any algorithmic trick is most likely going to involve doing FEWER calculations, not speeding up the calculations you already do.
Let's say you have your points A(1), A(2), ... A(n), each represented by their coordinates. I assume that at some point you calculate the distance from a(1) to a(2). Once you've done that, is your algorithm smart enough to not calculate the distance from a(2) to a(1) later? I would think so, but short of seeing your code, I make no assumptions.  [reply] 
Re: Processing data with lot of math...
by itub (Priest) on May 12, 2004 at 15:26 UTC

The problem is that the bruteforce "nested loops" algoritm scales as N^2. If you have 1000 atoms, you need almost 500,000 pairwise distance calculations! The solution is to divideandconquer by recursively partitioning the space into smaller boxes with a small number of atoms. That way you can get an almost N^1 scaling.
Take a look at the code in Chemistry::Bond::Find for an example. I just released it yesterday (a very early version...) because I had almost the same problem (maybe I was reading your mind?). With the naive N^2 algorithm it was taking almost three minutes to find the bonds in a protein with 1717 atoms; with the recursive partitioning it took only 10 seconds. Note, however, that if your cutoff distance is relatively large compared to the total system size, you won't get such a big improvement. Since I wanted to find the bonds, cutoffs were around 2 Å.
 [reply] 
Re: Processing data with lot of math...
by monkey_boy (Priest) on May 12, 2004 at 15:29 UTC

 [reply] 

This saves a great deal of time, as the sqrt() function is relativly expensive.
That used to be the true in the old times, when sqrt was calculated by software and the languages were simple. Well, simple languages still exist, but at least for Perl, the overhead of creating and destroying variables, moving things around and executing the opcodes are more important than what takes your hardware to do the floting point operations. Just see this benchmark for an example:
use Benchmark ':all';
our $a = 347.3;
our $b = 876.1;
our $c;
cmpthese( 5_000_000,
{
'rand' => sub { rand() },
add => sub { $c = $a + $b },
mul => sub { $c = $a * $b },
div => sub { $c = $a / $b },
'sqrt' => sub { $c = sqrt($b) },
}
);
Benchmark: timing 5000000 iterations of add, div, mul, rand, sqrt...
add: 2 wallclock secs ( 1.32 usr + 0.00 sys = 1.32 CPU) @ 3787878.79/s (n=5000000)
div: 1 wallclock secs ( 1.58 usr + 0.00 sys = 1.58 CPU) @ 3164556.96/s (n=5000000)
mul: 2 wallclock secs ( 1.50 usr + 0.00 sys = 1.50 CPU) @ 3333333.33/s (n=5000000)
rand: 1 wallclock secs ( 1.15 usr + 0.00 sys = 1.15 CPU) @ 4347826.09/s (n=5000000)
sqrt: 0 wallclock secs ( 1.18 usr + 0.01 sys = 1.19 CPU) @ 4201680.67/s (n=5000000)
Rate div mul add sqrt rand
div 3164557/s  5% 16% 25% 27%
mul 3333333/s 5%  12% 21% 23%
add 3787879/s 20% 14%  10% 13%
sqrt 4201681/s 33% 26% 11%  3%
rand 4347826/s 37% 30% 15% 3% 
Surprise! sqrt and rand may even be faster! The results are not entirely reproducible, but the short answer is: it doesn't really matter. As I said in a previous post, the problem here is the type of algorithm and how it scales.  [reply] [d/l] 

 [reply] 

Re: Processing data with lot of math...
by ysth (Canon) on May 12, 2004 at 18:14 UTC

 [reply] [d/l] [select] 

Why use sqrt? There is no added value in it; instead, transform the cutoff value.
QM

Quantum Mechanics: The dreams stuff is made of
 [reply] [d/l] 

 [reply] 


p( 10, 10, 10 );
p( 14.142, 10, 0 );
p( 10, 14.142, 0 );
Using the origin as the base, all three points have a calculated distance of 17.32 and will sort adjacent to each other, but they are obviously in completely different positions and the distance between each of the three points varies greatly.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"Think for yourself!"  Abigail
 [reply] [d/l] 

Look again.
The point is to avoid having both loops go over all the points. The distance from origin is just one way of easily filtering out some (with luck, most) points for the inner loop (since the distance between two points is guaranteed to be greater than or equal to the difference in their distances from a third point).
The distance between $one and $two is still checked.
 [reply] 




