Re: Iteration speed

Dear all,

Though this is a reply to meetraz, it involves having read the next 10 or so replies too.

I'm now posting an example of the file data that I use, and the pseudo-algorithm

But before I do so, I want to say that probably the best reply of the lot came from 'toma' I will truly go away and study computational geomtry now, thanks toma!

File:

ATOM      1  N   GLY A   1      43.202  54.570  86.432  1.00 44.15    
+       N
ATOM      2  CA  GLY A   1      44.109  54.807  85.249  1.00 45.94    
+       C
ATOM      3  C   GLY A   1      44.984  56.034  85.443  1.00 43.20    
+       C
ATOM      4  O   GLY A   1      45.070  56.527  86.578  1.00 46.54    
+       O
ATOM      5  N   SER A   2      45.617  56.538  84.378  1.00 38.14    
+       N
ATOM      6  CA  SER A   2      46.461  57.710  84.544  1.00 33.00    
+       C
ATOM      7  C   SER A   2      46.057  59.017  83.842  1.00 29.38    
+       C
ATOM      8  O   SER A   2      46.522  60.071  84.270  1.00 32.21    
+       O
ATOM      9  CB  SER A   2      47.972  57.377  84.387  1.00 29.10    
+       C
ATOM     10  OG  SER A   2      48.326  56.931  83.084  1.00 25.44    
+       O
ATOM     11  N   HIS A   3      45.172  59.000  82.838  1.00 24.08    
+       N
ATOM     12  CA  HIS A   3      44.778  60.271  82.173  1.00 20.59    
+       C
ATOM     13  C   HIS A   3      43.299  60.425  81.787  1.00 18.87    
+       C
ATOM     14  O   HIS A   3      42.664  59.463  81.375  1.00 16.14    
+       O
ATOM     15  CB  HIS A   3      45.597  60.514  80.897  1.00 17.13    
+       C
ATOM     16  CG  HIS A   3      47.060  60.648  81.145  1.00 20.02    
+       C
ATOM     17  ND1 HIS A   3      47.596  61.677  81.887  1.00 20.09    
+       N
ATOM     18  CD2 HIS A   3      48.099  59.855  80.797  1.00 21.10    
+       C
ATOM     19  CE1 HIS A   3      48.904  61.516  81.989  1.00 19.86    
+       C
ATOM     20  NE2 HIS A   3      49.233  60.417  81.333  1.00 18.86    
+       N
[download]

The co-ordinates are indeed cartesian, the last three columns are meaningless.

The other important columns are the second through to the sixth columns, in order:

atom number
atom name
residue name
chain id
residue number

I should add that each line is for an atom and not for a residue, though I was originally talking about residue iteration, I use residue iteration to try and avoid any extra atomic iteration, but in reality, the number of lines is about 10-20 times the number of residues themselves, depending on the residue type

The pseudo-code, which I'l post below, will ignore the format listed above, because each line is for one atom, and I have a coded PDB::Atom object (home-made) that takes the line as it's read, and parses it.

For the sake of making the psuedo-code concise, it is indeed 'PSEUDO' so please dont expect it to run at all!

The function 'addToMemory' simply fills the different hashes I use for lookup, especially to recall all the atoms in a residue. It is in there also, that the atomic data is added to the database, so there is one DB call per line in the file, this is one bottleneck, but much less of a bottleneck than the sheer number of iterations themselves

I try to cut down on the iterations, by pre-calculating the 3D center of the residue, and then comparing the distance of a residue pair to a hard-coded cut-off (varies depending on residues themselves) in 'notClose' (not shown). This avoids having to iterate through all the atoms in a residue, if there is no chance of a bond.

Finally the bond detection itself is in another function, not shown, and doesn't necessarily return a bond, there is more calculations depending on the nature of the atom itself. I just wanted to show the nature of the iterations themselves.

It should be noted that there is a large amount of processing for particular residues and atoms due to possible errors in the file, which I've excluded.

 open (FILE, "< ".$file ) or die "\ncouldn't open FILE: ".$file.": $! 
+";
    my $i=0;
    while(<FILE>){
    if($_ =~ /^ATOM/){
        $temp = new PDB::Atom('-line' => $_, '-number'=>$i);
        addToMemory($self, $temp, $i);

        ########bond detection ################# 
            ###iterate through chains###
        foreach my $ch (sort {$a <=> $b} keys %{$self->{'contacts'}){
            next if $temp->chain eq $ch; #skip same chains

            ###iterate through residues###
        foreach my $r (sort {$a <=> $b} keys %{$self->{'contacts'}{$ch
+}}){
            next if notClose($temp,self->{'contacts'}{$ch}{$r});
            foreach my $a (sort {$a <=> $b} keys %{$self->{'residues'}
+{$ch.$r}}){
                                $bonds->newBond($temp,$self->{'all'}{$
+a});
            }
            }
        }
        }
    }
    close(FILE);

sub addToMemory{
    my ($self, $atom, $numb) = @_;

    $self->{'all'}{$numb}=$atom;
    unless($atom->proton){
    $self->{'atoms'}{$numb} = 1;
    }else{
    $self->{'protons'}{$numb} = 1;
    }
    
    $self->{'contacts'}{$atom->chain}{$atom->resNumber}=residualAverag
+e($atom, $self->{'contacts'}{$atom->chain}{$atom->resNumber});
    $self->{'residues'}{$ch.$atom->resNumber}{$numb}=1;
}
[download]

Edited by Chady -- converted pre to code tags.

Comment on Re: Iteration speed Select or Download Code

Replies are listed 'Best First'.
Re^2: Iteration speed by BrowserUk (Patriarch) on Jun 16, 2004 at 18:32 UTC
Sorry, but your pseudo code is just a little to pseudo. Without a clear understanding of the internals of the PDB::Atom objects and it's methods, I find it impossible to understand what is going on. One casual comment though. In general, perl's objects carry a considerable performance penalty relative to it's standard hashes and array's. Iterating large volumes of data represented as objects will be much slower than processing that same data stored in a hash or an array. This is no surprise, nor a critisism of Perl. Just s statement of fact. The extra levels of indirection and lookup are bound to carry a penalty, but when dealing with large volumes where speed is a criteria, it is best avoided. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply]
Re^2: Iteration speed by seaver (Pilgrim) on Jun 16, 2004 at 19:35 UTC
__UPDATE____ dear all I've started a 'profile' on one of my biggest files (15 chains!) and here's the most telling results: %Time Sec. #calls sec/call F name 17.79 1043.2695 10727535 0.000097 PDB::Bonds::notClose 14.88 872.6432 14509596 0.000060 PDB::Writer::numberFormat 10.52 616.7291 14509596 0.000043 PDB::Bond::dist 6.89 403.8085 14509597 0.000028 PDB::Writer::pad_left 6.40 375.5808 1 375.580791 ? HighPDB::parse 6.39 374.5854 14509596 0.000026 PDB::Writer::pad_right 5.54 325.1066 43586508 0.000007 UNIVERSAL::isa 4.89 286.5572 1881489 0.000152 PDB::Bond::new 3.49 204.5796 18291657 0.000011 PDB::Atom::x 3.42 200.8238 18291657 0.000011 PDB::Atom::y 3.23 189.6586 18291657 0.000010 PDB::Atom::z 3.14 184.3266 1881489 0.000098 PDB::Bond::isHydphb 3.14 184.2096 1880381 0.000098 PDB::Bond::isElcsta 2.24 131.0808 1 131.080769 WhatIf::doWhatif 1.91 111.7546 10730691 0.000010 PDB::Atom::resName The code for the first three subroutines are shown here: sub notClose{ my $self=shift; my ($a1,$a2,$m)=@_; my %hash = %$a2; my $cutoff = $lengths{$a1->resName} + 7 + $hash{'r'}; my $dist = PDB::Bond->dist($a1->x,$hash{'x'},$a1->y,$hash{'y'},$a1 +->z,$hash{'z'}); my $value = $dist<$cutoff ? 0 : 1; return $value; } sub dist{my $self=shift if UNIVERSAL::isa($_[0] => __PACKAGE__); my ($x1, $x2, $y1, $y2, $z1, $z2)=@_; return numberFormat(sqrt ( ($x1 - $x2)2 + ($y1 - $y2)2 + ($z1 - $z2)**2 ), 1,2); } sub numberFormat{ my( $number, $whole, $frac ) = @_; return pad_left('0',$whole,'0').'.'.pad_right('0',$frac,'0')if $nu +mber == 0; return pad_left($number,$whole,'0') unless $number =~ /\./ \|\| $fra +c; my ($left,$right); ($left,$right) = split /\./, $number; $left = pad_left($left, $whole, '0'); if(defined $right){ $right = pad_right( substr($right,0,$frac), $frac, '0' ); return "$left\.$right"; }else{ $right = pad_right( '0', $frac, '0'); return "$left\.$right"; } } sub pad_left { my $self=shift if UNIVERSAL::isa($_[0] => __PACKAGE__); my ($item, $size, $padding) = @_; my $newItem = $item; $padding = ' ' unless defined $padding; while( length $newItem < $size ) { $newItem = "$padding$newItem"; } return $newItem; } sub pad_right { my $self=shift if UNIVERSAL::isa($_[0] => __PACKAGE__); my ($item, $size, $padding) = @_; my $newItem = $item; $padding = ' ' unless defined $padding; while( length $newItem < $size ) { $newItem .= $padding; } return $newItem; } [download] I had totally forgotten that I use numberFormat to manipulate the result of the sqrt function. (This is essential for the DB) I'm now going to move this to the DB part, so that it only gets called when adding 'real' bonds to the DB. I'm also going to remove the UNIVERSAL::isa calls, and just try to ASSUME $self whenever I can. Thanks for all the help, and I'm still investigating mr. Delauney. Cheers Sam	[reply] [d/l]
Re^3: Iteration speed by BrowserUk (Patriarch) on Jun 16, 2004 at 20:35 UTC
FWIW. Here is the code I mentioned earlier. It just completed a run looking for pairs of atoms that are within .01 units of each other. The input was 1,000,000 atom coordinates randomly generated in file that looks like this `atom000001 : 0119.110 0443.939 0146.027 atom000002 : -217.194 -175.476 -200.134 atom000003 : 0202.789 -383.911 0183.319 atom000004 : -459.869 0354.187 -421.509 atom000005 : 0446.625 0097.504 0243.835` [download] It found 308,822 pairs within the requisite distance of one another (from the 1,000,000,000,000 possibles) in just under 7 hours on 2Ghz machine. Whether the technique is adaptable to your application I'm not sure. Read more... (3 kB) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply] [d/l] [select]