Re: Iteration speed

Replies are listed 'Best First'.
Re: Iteration speed by seaver (Pilgrim) on Jun 16, 2004 at 14:32 UTC
Dear all, Though this is a reply to meetraz, it involves having read the next 10 or so replies too. I'm now posting an example of the file data that I use, and the pseudo-algorithm But before I do so, I want to say that probably the best reply of the lot came from 'toma' I will truly go away and study computational geomtry now, thanks toma! File: ATOM 1 N GLY A 1 43.202 54.570 86.432 1.00 44.15 + N ATOM 2 CA GLY A 1 44.109 54.807 85.249 1.00 45.94 + C ATOM 3 C GLY A 1 44.984 56.034 85.443 1.00 43.20 + C ATOM 4 O GLY A 1 45.070 56.527 86.578 1.00 46.54 + O ATOM 5 N SER A 2 45.617 56.538 84.378 1.00 38.14 + N ATOM 6 CA SER A 2 46.461 57.710 84.544 1.00 33.00 + C ATOM 7 C SER A 2 46.057 59.017 83.842 1.00 29.38 + C ATOM 8 O SER A 2 46.522 60.071 84.270 1.00 32.21 + O ATOM 9 CB SER A 2 47.972 57.377 84.387 1.00 29.10 + C ATOM 10 OG SER A 2 48.326 56.931 83.084 1.00 25.44 + O ATOM 11 N HIS A 3 45.172 59.000 82.838 1.00 24.08 + N ATOM 12 CA HIS A 3 44.778 60.271 82.173 1.00 20.59 + C ATOM 13 C HIS A 3 43.299 60.425 81.787 1.00 18.87 + C ATOM 14 O HIS A 3 42.664 59.463 81.375 1.00 16.14 + O ATOM 15 CB HIS A 3 45.597 60.514 80.897 1.00 17.13 + C ATOM 16 CG HIS A 3 47.060 60.648 81.145 1.00 20.02 + C ATOM 17 ND1 HIS A 3 47.596 61.677 81.887 1.00 20.09 + N ATOM 18 CD2 HIS A 3 48.099 59.855 80.797 1.00 21.10 + C ATOM 19 CE1 HIS A 3 48.904 61.516 81.989 1.00 19.86 + C ATOM 20 NE2 HIS A 3 49.233 60.417 81.333 1.00 18.86 + N [download] The co-ordinates are indeed cartesian, the last three columns are meaningless. The other important columns are the second through to the sixth columns, in order: atom number atom name residue name chain id residue number I should add that each line is for an atom and not for a residue, though I was originally talking about residue iteration, I use residue iteration to try and avoid any extra atomic iteration, but in reality, the number of lines is about 10-20 times the number of residues themselves, depending on the residue type The pseudo-code, which I'l post below, will ignore the format listed above, because each line is for one atom, and I have a coded PDB::Atom object (home-made) that takes the line as it's read, and parses it. For the sake of making the psuedo-code concise, it is indeed 'PSEUDO' so please dont expect it to run at all! The function 'addToMemory' simply fills the different hashes I use for lookup, especially to recall all the atoms in a residue. It is in there also, that the atomic data is added to the database, so there is one DB call per line in the file, this is one bottleneck, but much less of a bottleneck than the sheer number of iterations themselves I try to cut down on the iterations, by pre-calculating the 3D center of the residue, and then comparing the distance of a residue pair to a hard-coded cut-off (varies depending on residues themselves) in 'notClose' (not shown). This avoids having to iterate through all the atoms in a residue, if there is no chance of a bond. Finally the bond detection itself is in another function, not shown, and doesn't necessarily return a bond, there is more calculations depending on the nature of the atom itself. I just wanted to show the nature of the iterations themselves. It should be noted that there is a large amount of processing for particular residues and atoms due to possible errors in the file, which I've excluded. open (FILE, "< ".$file ) or die "\ncouldn't open FILE: ".$file.": $! +"; my $i=0; while(<FILE>){ if($_ =~ /^ATOM/){ $temp = new PDB::Atom('-line' => $_, '-number'=>$i); addToMemory($self, $temp, $i); ########bond detection ################# ###iterate through chains### foreach my $ch (sort {$a <=> $b} keys %{$self->{'contacts'}){ next if $temp->chain eq $ch; #skip same chains ###iterate through residues### foreach my $r (sort {$a <=> $b} keys %{$self->{'contacts'}{$ch +}}){ next if notClose($temp,self->{'contacts'}{$ch}{$r}); foreach my $a (sort {$a <=> $b} keys %{$self->{'residues'} +{$ch.$r}}){ $bonds->newBond($temp,$self->{'all'}{$ +a}); } } } } } close(FILE); sub addToMemory{ my ($self, $atom, $numb) = @_; $self->{'all'}{$numb}=$atom; unless($atom->proton){ $self->{'atoms'}{$numb} = 1; }else{ $self->{'protons'}{$numb} = 1; } $self->{'contacts'}{$atom->chain}{$atom->resNumber}=residualAverag +e($atom, $self->{'contacts'}{$atom->chain}{$atom->resNumber}); $self->{'residues'}{$ch.$atom->resNumber}{$numb}=1; } [download] Edited by Chady -- converted pre to code tags.	[reply] [d/l] [select]
Re^2: Iteration speed by BrowserUk (Patriarch) on Jun 16, 2004 at 18:32 UTC
Sorry, but your pseudo code is just a little to pseudo. Without a clear understanding of the internals of the PDB::Atom objects and it's methods, I find it impossible to understand what is going on. One casual comment though. In general, perl's objects carry a considerable performance penalty relative to it's standard hashes and array's. Iterating large volumes of data represented as objects will be much slower than processing that same data stored in a hash or an array. This is no surprise, nor a critisism of Perl. Just s statement of fact. The extra levels of indirection and lookup are bound to carry a penalty, but when dealing with large volumes where speed is a criteria, it is best avoided. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply]
Re^2: Iteration speed by seaver (Pilgrim) on Jun 16, 2004 at 19:35 UTC
__UPDATE____ dear all I've started a 'profile' on one of my biggest files (15 chains!) and here's the most telling results: %Time Sec. #calls sec/call F name 17.79 1043.2695 10727535 0.000097 PDB::Bonds::notClose 14.88 872.6432 14509596 0.000060 PDB::Writer::numberFormat 10.52 616.7291 14509596 0.000043 PDB::Bond::dist 6.89 403.8085 14509597 0.000028 PDB::Writer::pad_left 6.40 375.5808 1 375.580791 ? HighPDB::parse 6.39 374.5854 14509596 0.000026 PDB::Writer::pad_right 5.54 325.1066 43586508 0.000007 UNIVERSAL::isa 4.89 286.5572 1881489 0.000152 PDB::Bond::new 3.49 204.5796 18291657 0.000011 PDB::Atom::x 3.42 200.8238 18291657 0.000011 PDB::Atom::y 3.23 189.6586 18291657 0.000010 PDB::Atom::z 3.14 184.3266 1881489 0.000098 PDB::Bond::isHydphb 3.14 184.2096 1880381 0.000098 PDB::Bond::isElcsta 2.24 131.0808 1 131.080769 WhatIf::doWhatif 1.91 111.7546 10730691 0.000010 PDB::Atom::resName The code for the first three subroutines are shown here: sub notClose{ my $self=shift; my ($a1,$a2,$m)=@_; my %hash = %$a2; my $cutoff = $lengths{$a1->resName} + 7 + $hash{'r'}; my $dist = PDB::Bond->dist($a1->x,$hash{'x'},$a1->y,$hash{'y'},$a1 +->z,$hash{'z'}); my $value = $dist<$cutoff ? 0 : 1; return $value; } sub dist{my $self=shift if UNIVERSAL::isa($_[0] => __PACKAGE__); my ($x1, $x2, $y1, $y2, $z1, $z2)=@_; return numberFormat(sqrt ( ($x1 - $x2)2 + ($y1 - $y2)2 + ($z1 - $z2)**2 ), 1,2); } sub numberFormat{ my( $number, $whole, $frac ) = @_; return pad_left('0',$whole,'0').'.'.pad_right('0',$frac,'0')if $nu +mber == 0; return pad_left($number,$whole,'0') unless $number =~ /\./ \|\| $fra +c; my ($left,$right); ($left,$right) = split /\./, $number; $left = pad_left($left, $whole, '0'); if(defined $right){ $right = pad_right( substr($right,0,$frac), $frac, '0' ); return "$left\.$right"; }else{ $right = pad_right( '0', $frac, '0'); return "$left\.$right"; } } sub pad_left { my $self=shift if UNIVERSAL::isa($_[0] => __PACKAGE__); my ($item, $size, $padding) = @_; my $newItem = $item; $padding = ' ' unless defined $padding; while( length $newItem < $size ) { $newItem = "$padding$newItem"; } return $newItem; } sub pad_right { my $self=shift if UNIVERSAL::isa($_[0] => __PACKAGE__); my ($item, $size, $padding) = @_; my $newItem = $item; $padding = ' ' unless defined $padding; while( length $newItem < $size ) { $newItem .= $padding; } return $newItem; } [download] I had totally forgotten that I use numberFormat to manipulate the result of the sqrt function. (This is essential for the DB) I'm now going to move this to the DB part, so that it only gets called when adding 'real' bonds to the DB. I'm also going to remove the UNIVERSAL::isa calls, and just try to ASSUME $self whenever I can. Thanks for all the help, and I'm still investigating mr. Delauney. Cheers Sam	[reply] [d/l]
Re^3: Iteration speed by BrowserUk (Patriarch) on Jun 16, 2004 at 20:35 UTC
FWIW. Here is the code I mentioned earlier. It just completed a run looking for pairs of atoms that are within .01 units of each other. The input was 1,000,000 atom coordinates randomly generated in file that looks like this `atom000001 : 0119.110 0443.939 0146.027 atom000002 : -217.194 -175.476 -200.134 atom000003 : 0202.789 -383.911 0183.319 atom000004 : -459.869 0354.187 -421.509 atom000005 : 0446.625 0097.504 0243.835` [download] It found 308,822 pairs within the requisite distance of one another (from the 1,000,000,000,000 possibles) in just under 7 hours on 2Ghz machine. Whether the technique is adaptable to your application I'm not sure. Read more... (3 kB) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply] [d/l] [select]