Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Processing data with lot of math...

by qhayaal (Beadle)
on May 12, 2004 at 14:31 UTC ( [id://352756]=perlquestion: print w/replies, xml ) Need Help??

qhayaal has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have thousands of files, each with thousands of entries. I need to process these files to do the following. Each line in any file has Name, X, Y, Z of an atom. Read each file and find all the pairs of atoms that are within a cut-off distance. I could do it, but the problem was it works too slow!

This was the sequence of steps I followed.
1. Read each file into an array directly.
2. Store X,Y,Z in three arrays.
3. With nested loops, find the pairs within cutoff distance.

What I would like to know is:
A. Is there a way of significantly improving the speed in Perl that involves this kind of math? If not,
B. If I have a fortran code (subroutine) that reads three arrays and then give me output, can I use the object file generated by that program to plugin to Perl?

Thanks in advance for any tips.

Have a nice day.

Replies are listed 'Best First'.
Re: Processing data with lot of math...
by swngnmonk (Pilgrim) on May 12, 2004 at 14:48 UTC

    Sounds to me like your first problem isn't the fact that Perl is slow. Your first problem is that you have no idea whether or not your approach for processing all this data is efficient or not!

    You're describing a pretty classic Graph Theory problem - taking a large set of points in space (3-dimensional space in this case), and finding the subset of points that are within a certain distance of eachother. Go find a good book on graph theory (Introduction to Algorithms is a great one), and make sure your approach is sound.

    Only then should you start worrying about whether or not perl is fast enough or not.

Re: Processing data with lot of math...
by BrowserUk (Patriarch) on May 12, 2004 at 17:07 UTC

    I think that it is possible to do this with one loop, and almost no math!

    However, I could be missing something, so read, digest, cogitate and draw your own conclusions.

    For example, if you have an atom located at point ( 10, 10, 10 ), and the cut-off distance is 1 unit, then any atom that will be paired with this atom will have to be somewhere within an box defined by the points

    p( 09, 09, 09 ) p( 09, 11, 09 ) p( 09, 11, 11 ) p( 09, 11, 11 ) p( 09, 11, 11 ) p( 11, 09, 09 ) p( 11, 09, 11 ) p( 11, 11, 11 )

    So, instead of storing the x:y:z in separate arrays, or some other complex data structure, create a single array, indexed by the concatentations of x:y:z, (with leading zeros).

    $atoms[ 090909 ] = 'name1'; $atoms[ 091109 ] = 'name2'; $atoms[ 091111 ] = 'name3'; $atoms[ 091111 ] = 'name4'; $atoms[ 091111 ] = 'name5'; $atoms[ 110909 ] = 'name6'; $atoms[ 110911 ] = 'name7'; $atoms[ 111111 ] = 'name8';

    Once you have created this (sparse) array, it becomes very easy to pick out the limited set of possible pairings for any given point. It requires a final math check to exclude those point lying in the corners of the box, but the number of calculations should be significantly less than the brute force method.

    There are some problems with this.

    1. I'm using integer coordinates, yours may well not be.

      If so, I would use a scaling factor upon each of the three coordinates to bring them to some manageable range of integers, and store the real coordinates along with the name. I would do this as a single string rather than as an array ref, as it costs very little extra memory to store a longer scalar in an array element, but creating thousands of small arrays costs lots.

    2. Depending upon how easily your coordinates map to integer ranges, and the size of those ranges, the size of the array will likely be enormously sparse, and too big for memory.

      In which case it makes sense to use a hash to implement a sparse array. The downsides of this is that the hash takes substantially more memory, and you then need to sort the hash keys prior to processing them.

      Depending upon how meany thousands of atoms we are talking about, this could well lead you to requiring more memory than is available, or even possible (assuming a 32-bit processor).

    At this point the solution becomes to process the files and create a single output file where each record is of the form

    xxyyzz name x.xx y.yy z.zz

    A single pass to create the file. Then sort the file (using your systems sort utility) and the a single pass to read the sorted output and do the final comparisons and write out the pairs in whatever format you need them.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail

      Hey, this approach works really well! I'll include it in my next version of Chemistry::Bond::Find. It scales almost linearly, maybe even better than the recursive partitioning algorithm as I implemented it, and it avoids some of the overhead of the recursive calls and the creation and passing of lots of data structures.

      For my test file with 1717 atoms, my recursive algorithm took about 9.7 seconds, while the new method based on your post took 3.2 seconds.

      I should remember that you can throw hashes at almost any problem with Perl...

        That's good. Unfortunately, anything involving recursion in perl tends to hit performance quite badly once the volumes of data rise.

        If the mapping to integer coords is working okay, then you could (potentially) save a bit more time on large volumes of data by concatenating the pack'd integers rather than the ascii representations.

        $hash{ join'', pack 'N3', $x, $y, $z } = 'name x.xx y.yy z.zz';

        This would allow you to do an alpha sort rather than a numeric one, and avoids the need to add leading zeros to the values. It might also conserve a little space, though you are unlikely to see this unless your resultant concatenated ascii values are greater than 12 characters.

        You'd need to benchmark some typical datasets to see what if any benefit this would have, and where an transitions occured.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
      Yes, but can this be simplified?

      For example, I can store all the boxes that exist in an array as: $box(i)="L:M:N". Now I can store cooridinates as:

      $x[l][m][n][i]=Xvalue of i'th atom in the box $y[l][m][n][i]=Yvalue of i'th atom in the box $z[l][m][n][i]=Zvalue of i'th atom in the box
      Then all I would need to do is to loop over all boxes (using @box) and checking existence of it's neighbouring boxes (I would need a hash here, I guess... Emmm..., box should be a hash and I can use its values in the above), and for all those exists I can loop over their atoms.

      Does this make sense or sound better?

        Okay. From this post I think that my original post did not explain the method I was trying to outline very well. Also, looking back at itubs reply, he also seems not to have fully understood what I was getting at.

        I've put together a crude simulation using some randomly generated data, and it seems that using my method will process: 10,000 atoms, in around 17 seconds using < 6 MB memory. 20,000 atoms, in around 112 seconds using < 11 MB 30,000 atoms, in around 350 seconds using < 14 MB.

        From these crude tests, I think it would handle 100,000 atoms in around 2 1/2 hours using 40 MB ram.

        Update Ignore the figures above. I discovered a stupidity in my testcase that meant I was doing way more work than I needed to. I can now processing 100,000 atoms using 50MB in under 4 minutes.

        This is without any attempt to really optimise things. I have no real feel for how this compares with other methods as I don't know what other methods are used.

        I don't know for sure that I am processing the data correctly.

        If you have a smallish (1000 or 2000 atom) dataset (with the desired results for a specified cutoff), that you could make available to me somehow?

        Posting this here is probably not a good idea:), but I could let you have my email id.

        Then I could see if what I am doing even approximates to giving correct results before I go posting triumphantly.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
Re: Processing data with lot of math...
by ColonelPanic (Friar) on May 12, 2004 at 14:43 UTC
    Your speed problem is probably caused by using up all of your RAM by putting the entire files in memory. If you can figure out an algorithm that doesn't do this, you'll be much better off. Also, if you post the code you have now, it will make it easier to offer additional help.


    When's the last time you used duct tape on a duct? --Larry Wall
      1. RAM problem.
      RAM is 512MB and swap 1GB.
      A file typically is 1.2MB
      No. of lines (atoms) 20,000.
      Note: This is in test phase, but it can increase by about 10-50 fold.

      2. Graph Theory problem.
      This unfortunately is not a Graph-Theory problem, but more of a gas-phase problem. I need to find out all the 'interacting-pairs' (ie pairs of atoms close enough) to do a more complicated analysis.

      3. FORTRAN object file.
      Ok, may be I can make this piece into a library file, so that I should be able to use it as a POSIX funtion?

      4. Grid-wise calculation.
      Promising. :) I thought, but was a bit lazy to try. After brute-force I began to wonder if it would be worth the effort... Thanks. :)

      5. Chemistry::Bond::Find
      Yes, I must try this. :) Mine's is a protein with hell-lot of water, and I have to find water-protein hydrogen bonds. At least based on distance alone. Will get back to you. itub. :D

      6. Using square of distance.
      Well, yes, I was already using the square of the distance, and yes, it is on PDB files. :)

      7. Using x:y:z boxinfo.
      Yes. I thought of it, but was lazy as I thought I need to put in lot of code. But now I am convinced it won't be so much. Thanks a lot BrowserUk. :)

      8. The code:
      # Open each PDB foreach my $pdb_file (<$pdb_list>) { { chomp($pdb_file); my $tmp_file; # We would open the PDB with this handle open($tmp_file, "< $pdb_file") or (die "Cannot open PDB: $ +!"); # Read X,Y,Z coordinates my @X; # X-coordinates my @Y; # Y-coordinates my @Z; # Z-coordinates my @pdb_tmp=<$tmp_file>; foreach (@pdb_tmp) { if (substr($_,0,3) eq "ATO") { push @X, substr($_,30,8); push @Y, substr($_,38,8); push @Z, substr($_,46,8); } } # Find the interaction pairs. Also strore the best angle, sh +ould it # occur again with another proton. Also keep track of the wa +ters # that have made h-bond with the solute atoms. my @sel_wat; # Array of water molecule (numbers) selected my %hbond; # $hbond[$tag1:$tag2][0]=distance # $hbond[$tag1:$tag2][0]=angle # Solute as donor and water as acceptor { my @atom_cov; # Polar solute atom that is already cover +ed. for (my $i=0; $i <= $#pol_h; $i++) { # If this donor is not already covered, then go ahead. if ( ! defined($atom_cov[$pol_h[$i][1]]) ) { for (my $j=0; $j <= $#wat_a; $j++) { my $dx=$X[$pol_h[$i][1]]-$X[$wat_a[$j]]; my $dy=$Y[$pol_h[$i][1]]-$Y[$wat_a[$j]]; my $dz=$Z[$pol_h[$i][1]]-$Z[$wat_a[$j]]; my $distSq=($dx*$dx)+($dy*$dy)+($dz*$dz); if ($distSq <= $hb_dist) { $atom_cov[$pol_h[$i][1]]=1; print $idty[$pol_h[$i][1]], " ", $idty[$wat_a[$j]], $distSq,"\n"; } } } } } } }

        Regarding the issue of finding H-bonds in PDB files:

        Besides all the algorithmic suggestions already given, The problem may be even "smaller" than it looks. If your PDB file has water molecules and the protein labeled properly, you now have to consider only water atom-protein atom pairs, instead of every possible pair. And not even every protein atom, if you restrict your definition of H-bond to the typical O...H or N...H.

        2. Graph Theory problem. This unfortunately is not a Graph-Theory problem, but more of a gas-phase problem. I need to find out all the 'interacting-pairs' (ie pairs of atoms close enough) to do a more complicated analysis.

        Your APPLICATION may be a gas-analysis problem, but the guts of it, finding points within a certain distance of a known point, is exactly a graph-theory problem.

Re: Processing data with lot of math...
by Abigail-II (Bishop) on May 12, 2004 at 14:49 UTC
    Is there a way of significantly improving the speed in Perl that involves this kind of math?
    Maybe, maybe not. You don't define how the "cutoff" distance is determined, so it's hard to say. Is it related to the lines where the atoms appear? Are X, Y, Z coordinates?
    If I have a fortran code (subroutine) that reads three arrays and then give me output, can I use the object file generated by that program to plugin to Perl?
    It's not clear what you mean here by "object file". Often, with "object file" we mean a file that can be linked into an executable, that is, the output of a compiler. But your description of how it's made suggests otherwise. If it contains data, then, yes, you can read it. You must know the structure of the data though to make sense out it - you may need unpack if the data is in some binary format.

    Abigail

Re: Processing data with lot of math...
by husker (Chaplain) on May 12, 2004 at 14:50 UTC
    Before we can suggest an improvement to your algorithm, we need to see what you are doing now. I imagine you are just brute-force calculating the distance from each atom to each other atom in your array and comparing that to your cut-off distance? The math there is pretty simple .. i don't think there's going to be a "trick" to calculating the distance between two points.

    Any algorithmic trick is most likely going to involve doing FEWER calculations, not speeding up the calculations you already do.

    Let's say you have your points A(1), A(2), ... A(n), each represented by their coordinates. I assume that at some point you calculate the distance from a(1) to a(2). Once you've done that, is your algorithm smart enough to not calculate the distance from a(2) to a(1) later? I would think so, but short of seeing your code, I make no assumptions.

Re: Processing data with lot of math...
by itub (Priest) on May 12, 2004 at 15:26 UTC

    The problem is that the brute-force "nested loops" algoritm scales as N^2. If you have 1000 atoms, you need almost 500,000 pairwise distance calculations! The solution is to divide-and-conquer by recursively partitioning the space into smaller boxes with a small number of atoms. That way you can get an almost N^1 scaling.

    Take a look at the code in Chemistry::Bond::Find for an example. I just released it yesterday (a very early version...) because I had almost the same problem (maybe I was reading your mind?). With the naive N^2 algorithm it was taking almost three minutes to find the bonds in a protein with 1717 atoms; with the recursive partitioning it took only 10 seconds. Note, however, that if your cutoff distance is relatively large compared to the total system size, you won't get such a big improvement. Since I wanted to find the bonds, cutoffs were around 2 Å.

Re: Processing data with lot of math...
by monkey_boy (Priest) on May 12, 2004 at 15:29 UTC
    I guess these are PDB files(?).
    I have done similar manipulations to pdb files with few (speed) problems. There will always be some performance issues when calculating an "all by all" distance matrix, but i dont think the problem here is with RAM, as most pdb files are not that big. Perhaps if you post the code i may be able to help.

    one optimisation is if the distance cutoff is fixed (i.e. 3 angstroms).
    Then instead of calculating the euclidean distance, calculate the square euclidean distance & see if it is > 9 (i.e. 3**2).
    This saves a great deal of time, as the sqrt() function is relativly expensive.
    hope this helps.
    I should really do something about this apathy ... but i just cant be bothered

      This saves a great deal of time, as the sqrt() function is relativly expensive.

      That used to be the true in the old times, when sqrt was calculated by software and the languages were simple. Well, simple languages still exist, but at least for Perl, the overhead of creating and destroying variables, moving things around and executing the opcodes are more important than what takes your hardware to do the floting point operations. Just see this benchmark for an example:

      use Benchmark ':all'; our $a = 347.3; our $b = 876.1; our $c; cmpthese( 5_000_000, { 'rand' => sub { rand() }, add => sub { $c = $a + $b }, mul => sub { $c = $a * $b }, div => sub { $c = $a / $b }, 'sqrt' => sub { $c = sqrt($b) }, } );

      Benchmark: timing 5000000 iterations of add, div, mul, rand, sqrt...
             add:  2 wallclock secs ( 1.32 usr +  0.00 sys =  1.32 CPU) @ 3787878.79/s (n=5000000)
             div:  1 wallclock secs ( 1.58 usr +  0.00 sys =  1.58 CPU) @ 3164556.96/s (n=5000000)
             mul:  2 wallclock secs ( 1.50 usr +  0.00 sys =  1.50 CPU) @ 3333333.33/s (n=5000000)
            rand:  1 wallclock secs ( 1.15 usr +  0.00 sys =  1.15 CPU) @ 4347826.09/s (n=5000000)
            sqrt:  0 wallclock secs ( 1.18 usr +  0.01 sys =  1.19 CPU) @ 4201680.67/s (n=5000000)
                Rate  div  mul  add sqrt rand
      div  3164557/s   --  -5% -16% -25% -27%
      mul  3333333/s   5%   -- -12% -21% -23%
      add  3787879/s  20%  14%   -- -10% -13%
      sqrt 4201681/s  33%  26%  11%   --  -3%
      rand 4347826/s  37%  30%  15%   3%   --
      

      Surprise! sqrt and rand may even be faster! The results are not entirely reproducible, but the short answer is: it doesn't really matter. As I said in a previous post, the problem here is the type of algorithm and how it scales.

        Erm... i dont understand, its not really the point how fast sqrt() is , the main thing is your doing an extra calculation that is completly redundant,
        If you have 10,000 atoms then 10,000*10,000 sqrt()'s has an effect on performance,
        It must be better to do just one $cutoff_dist**2?
        However i do agree that the any improovments to the brute force method would be better,
        Cheers , monkey_boy
        I should really do something about this apathy ... but i just cant be bothered
Re: Processing data with lot of math...
by ysth (Canon) on May 12, 2004 at 18:14 UTC
    Some (untested) pseudo-code that may help (assuming from your description that you have separate arrays for each of X, Y, and Z):
    • calculate distance from origin (or some other point) for each point in your set
      @dist = map sqrt($x[$_]**2+$y[$_]**2+$z[$_]**2), 0..$#x; or, for some $x0, $y0, $z0, @dist = map sqrt(($x[$_]-$x0)**2+($y[$_]-$y0)**2+($z[$_]-$z0)**2), 0..$#x;
    • sort points by that distance
      @indices = sort {$dist[$a]<=>$dist[$b]} 0..$#dist; @dist = @dist[@indices]; @x = @x[@indices]; @y = @y[@indices]; @z = @z[@indices]; @name = @name[@incides];
    • loop through your points, for each point checking distance of the following points only until the distance from origin has increased by your cutoff
      for my $one (0..$#@dist-1) { my $two = $one; while (++$two < @dist && $dist[$two]-$dist[$one] <= $cutoff) { print "found pair: ($x[$one],$y[$one],$z[$one]) and ($x[$two],$y[$two],$z[$two])\n" if sqrt(($x[$one]-$x[$two])**2+($y[$one]-$y[$two])**2+($z[ +$one]-$z[$two])**2) <= $cutoff; } }
    This limits how much work the interior loop has to do for each iteration of the exterior loop (unless you have points mosty lying within cutoff distance from the surface of a sphere around your chosen origin, so some knowlege of your data will help choose a good origin).

    Update: added code fragments
    Update: sort the names with the points :)

      Why use sqrt? There is no added value in it; instead, transform the cutoff value.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        Because there's no point in optimizing until you have triaged your algorithm (irremedially slow, a little too slow, or fast enough). If I were certain what form the original questioner's points stored in, I would have even used a distance() sub. Code should be written for minimum conceptual complexity and then optimized only if needed, and having a separate $cutoff_squared checking against a square distance is beyond my minimum. YMMV.

        Update: the above is obviously hypocritical; I get sucked into premature optimization and complexity fascination all the time.

      This method won't work. Take the three points

      p( 10, 10, 10 ); p( 14.142, 10, 0 ); p( 10, 14.142, 0 );

      Using the origin as the base, all three points have a calculated distance of 17.32 and will sort adjacent to each other, but they are obviously in completely different positions and the distance between each of the three points varies greatly.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
        Look again.

        The point is to avoid having both loops go over all the points. The distance from origin is just one way of easily filtering out some (with luck, most) points for the inner loop (since the distance between two points is guaranteed to be greater than or equal to the difference in their distances from a third point).

        The distance between $one and $two is still checked.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://352756]
Approved by Paladin
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-03-19 08:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found