Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Iteration speed

by meetraz (Hermit)
on Jun 15, 2004 at 17:24 UTC ( [id://366955]=note: print w/replies, xml ) Need Help??


in reply to Iteration speed

Can you post pseudo-code? It's hard to determine where the bottlenecks might be, without seeing the structure.

Replies are listed 'Best First'.
Re: Iteration speed
by seaver (Pilgrim) on Jun 16, 2004 at 14:32 UTC
    Dear all,

    Though this is a reply to meetraz, it involves having read the next 10 or so replies too.

    I'm now posting an example of the file data that I use, and the pseudo-algorithm

    But before I do so, I want to say that probably the best reply of the lot came from 'toma' I will truly go away and study computational geomtry now, thanks toma!

    File:

    ATOM 1 N GLY A 1 43.202 54.570 86.432 1.00 44.15 + N ATOM 2 CA GLY A 1 44.109 54.807 85.249 1.00 45.94 + C ATOM 3 C GLY A 1 44.984 56.034 85.443 1.00 43.20 + C ATOM 4 O GLY A 1 45.070 56.527 86.578 1.00 46.54 + O ATOM 5 N SER A 2 45.617 56.538 84.378 1.00 38.14 + N ATOM 6 CA SER A 2 46.461 57.710 84.544 1.00 33.00 + C ATOM 7 C SER A 2 46.057 59.017 83.842 1.00 29.38 + C ATOM 8 O SER A 2 46.522 60.071 84.270 1.00 32.21 + O ATOM 9 CB SER A 2 47.972 57.377 84.387 1.00 29.10 + C ATOM 10 OG SER A 2 48.326 56.931 83.084 1.00 25.44 + O ATOM 11 N HIS A 3 45.172 59.000 82.838 1.00 24.08 + N ATOM 12 CA HIS A 3 44.778 60.271 82.173 1.00 20.59 + C ATOM 13 C HIS A 3 43.299 60.425 81.787 1.00 18.87 + C ATOM 14 O HIS A 3 42.664 59.463 81.375 1.00 16.14 + O ATOM 15 CB HIS A 3 45.597 60.514 80.897 1.00 17.13 + C ATOM 16 CG HIS A 3 47.060 60.648 81.145 1.00 20.02 + C ATOM 17 ND1 HIS A 3 47.596 61.677 81.887 1.00 20.09 + N ATOM 18 CD2 HIS A 3 48.099 59.855 80.797 1.00 21.10 + C ATOM 19 CE1 HIS A 3 48.904 61.516 81.989 1.00 19.86 + C ATOM 20 NE2 HIS A 3 49.233 60.417 81.333 1.00 18.86 + N
    The co-ordinates are indeed cartesian, the last three columns are meaningless.

    The other important columns are the second through to the sixth columns, in order:

    1. atom number
    2. atom name
    3. residue name
    4. chain id
    5. residue number

    I should add that each line is for an atom and not for a residue, though I was originally talking about residue iteration, I use residue iteration to try and avoid any extra atomic iteration, but in reality, the number of lines is about 10-20 times the number of residues themselves, depending on the residue type

    The pseudo-code, which I'l post below, will ignore the format listed above, because each line is for one atom, and I have a coded PDB::Atom object (home-made) that takes the line as it's read, and parses it.

    For the sake of making the psuedo-code concise, it is indeed 'PSEUDO' so please dont expect it to run at all!

    The function 'addToMemory' simply fills the different hashes I use for lookup, especially to recall all the atoms in a residue. It is in there also, that the atomic data is added to the database, so there is one DB call per line in the file, this is one bottleneck, but much less of a bottleneck than the sheer number of iterations themselves

    I try to cut down on the iterations, by pre-calculating the 3D center of the residue, and then comparing the distance of a residue pair to a hard-coded cut-off (varies depending on residues themselves) in 'notClose' (not shown). This avoids having to iterate through all the atoms in a residue, if there is no chance of a bond.

    Finally the bond detection itself is in another function, not shown, and doesn't necessarily return a bond, there is more calculations depending on the nature of the atom itself. I just wanted to show the nature of the iterations themselves.

    It should be noted that there is a large amount of processing for particular residues and atoms due to possible errors in the file, which I've excluded.

    open (FILE, "< ".$file ) or die "\ncouldn't open FILE: ".$file.": $! +"; my $i=0; while(<FILE>){ if($_ =~ /^ATOM/){ $temp = new PDB::Atom('-line' => $_, '-number'=>$i); addToMemory($self, $temp, $i); ########bond detection ################# ###iterate through chains### foreach my $ch (sort {$a <=> $b} keys %{$self->{'contacts'}){ next if $temp->chain eq $ch; #skip same chains ###iterate through residues### foreach my $r (sort {$a <=> $b} keys %{$self->{'contacts'}{$ch +}}){ next if notClose($temp,self->{'contacts'}{$ch}{$r}); foreach my $a (sort {$a <=> $b} keys %{$self->{'residues'} +{$ch.$r}}){ $bonds->newBond($temp,$self->{'all'}{$ +a}); } } } } } close(FILE); sub addToMemory{ my ($self, $atom, $numb) = @_; $self->{'all'}{$numb}=$atom; unless($atom->proton){ $self->{'atoms'}{$numb} = 1; }else{ $self->{'protons'}{$numb} = 1; } $self->{'contacts'}{$atom->chain}{$atom->resNumber}=residualAverag +e($atom, $self->{'contacts'}{$atom->chain}{$atom->resNumber}); $self->{'residues'}{$ch.$atom->resNumber}{$numb}=1; }

    Edited by Chady -- converted pre to code tags.

      Sorry, but your pseudo code is just a little to pseudo. Without a clear understanding of the internals of the PDB::Atom objects and it's methods, I find it impossible to understand what is going on.

      One casual comment though. In general, perl's objects carry a considerable performance penalty relative to it's standard hashes and array's. Iterating large volumes of data represented as objects will be much slower than processing that same data stored in a hash or an array.

      This is no surprise, nor a critisism of Perl. Just s statement of fact. The extra levels of indirection and lookup are bound to carry a penalty, but when dealing with large volumes where speed is a criteria, it is best avoided.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
      __UPDATE____

      dear all

      I've started a 'profile' on one of my biggest files (15 chains!) and here's the most telling results:

      %Time    Sec.     #calls   sec/call  F  name
      17.79 1043.2695  10727535   0.000097     PDB::Bonds::notClose
      14.88  872.6432  14509596   0.000060     PDB::Writer::numberFormat
      10.52  616.7291  14509596   0.000043     PDB::Bond::dist
       6.89  403.8085  14509597   0.000028     PDB::Writer::pad_left
       6.40  375.5808        1  375.580791  ?  HighPDB::parse
       6.39  374.5854  14509596   0.000026     PDB::Writer::pad_right
       5.54  325.1066  43586508   0.000007     UNIVERSAL::isa
       4.89  286.5572  1881489   0.000152     PDB::Bond::new
       3.49  204.5796  18291657   0.000011     PDB::Atom::x
       3.42  200.8238  18291657   0.000011     PDB::Atom::y
       3.23  189.6586  18291657   0.000010     PDB::Atom::z
       3.14  184.3266  1881489   0.000098     PDB::Bond::isHydphb
       3.14  184.2096  1880381   0.000098     PDB::Bond::isElcsta
       2.24  131.0808        1  131.080769     WhatIf::doWhatif
       1.91  111.7546  10730691   0.000010     PDB::Atom::resName
      
      The code for the first three subroutines are shown here:

      sub notClose{ my $self=shift; my ($a1,$a2,$m)=@_; my %hash = %$a2; my $cutoff = $lengths{$a1->resName} + 7 + $hash{'r'}; my $dist = PDB::Bond->dist($a1->x,$hash{'x'},$a1->y,$hash{'y'},$a1 +->z,$hash{'z'}); my $value = $dist<$cutoff ? 0 : 1; return $value; } sub dist{my $self=shift if UNIVERSAL::isa($_[0] => __PACKAGE__); my ($x1, $x2, $y1, $y2, $z1, $z2)=@_; return numberFormat(sqrt ( ($x1 - $x2)**2 + ($y1 - $y2)**2 + ($z1 - $z2)**2 ), 1,2); } sub numberFormat{ my( $number, $whole, $frac ) = @_; return pad_left('0',$whole,'0').'.'.pad_right('0',$frac,'0')if $nu +mber == 0; return pad_left($number,$whole,'0') unless $number =~ /\./ || $fra +c; my ($left,$right); ($left,$right) = split /\./, $number; $left = pad_left($left, $whole, '0'); if(defined $right){ $right = pad_right( substr($right,0,$frac), $frac, '0' ); return "$left\.$right"; }else{ $right = pad_right( '0', $frac, '0'); return "$left\.$right"; } } sub pad_left { my $self=shift if UNIVERSAL::isa($_[0] => __PACKAGE__); my ($item, $size, $padding) = @_; my $newItem = $item; $padding = ' ' unless defined $padding; while( length $newItem < $size ) { $newItem = "$padding$newItem"; } return $newItem; } sub pad_right { my $self=shift if UNIVERSAL::isa($_[0] => __PACKAGE__); my ($item, $size, $padding) = @_; my $newItem = $item; $padding = ' ' unless defined $padding; while( length $newItem < $size ) { $newItem .= $padding; } return $newItem; }

      I had totally forgotten that I use numberFormat to manipulate the result of the sqrt function. (This is essential for the DB) I'm now going to move this to the DB part, so that it only gets called when adding 'real' bonds to the DB.

      I'm also going to remove the UNIVERSAL::isa calls, and just try to ASSUME $self whenever I can.

      Thanks for all the help, and I'm still investigating mr. Delauney.

      Cheers
      Sam

        FWIW. Here is the code I mentioned earlier. It just completed a run looking for pairs of atoms that are within .01 units of each other.

        The input was 1,000,000 atom coordinates randomly generated in file that looks like this

        atom000001 : 0119.110 0443.939 0146.027 atom000002 : -217.194 -175.476 -200.134 atom000003 : 0202.789 -383.911 0183.319 atom000004 : -459.869 0354.187 -421.509 atom000005 : 0446.625 0097.504 0243.835

        It found 308,822 pairs within the requisite distance of one another (from the 1,000,000,000,000 possibles) in just under 7 hours on 2Ghz machine.

        Whether the technique is adaptable to your application I'm not sure.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://366955]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-04-25 23:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found