ioana has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I want to check for overlap of two files.
In both files the lines look this way:
ATOM 1 K+ K+ 1 27.261 7.934 32.141 1.0000 1.000
ATOM 1 N ALA 1 -12.149 -11.899 13.144 0.1414 1.550
Columns 6-8 contain the x.y,z coordinates, and the last column
contains the radius. I want to remove from <SOLVFILE> all the entries that
overlap with the <STRUCTFILE>. The overlap criteria is that
the distance between entries in SOLVFILE and entries in STRUCTFILE
should be larger than the sum of the radii. If d<R1+R2 consider an overlap and remove the line from the <SOLVFILE>
Here is the code
#!/usr/bin/perl -w @desBooli = (); @ATOM = (); @atmnb1 = (); @atmname1 = (); @resname1 = (); @resnb1 = (); @rest1 = (); @x1 = (); @y1 = (); @z1 = (); @q1 = (); @rad1 = (); @atmnb2 = (); @atmname2 = (); @resname2 = (); @resnb2 = (); @x2 = (); @y2 = (); @z2 = (); @q2 = (); @rad2 = (); print "Please type the filename with the solvent: "; $file1 = <STDIN>; chomp $file1; #Does the file exist? unless ( -e $file1) { print "File \"$file1\" doesn\'t seem to exist!!\n"; exit; } #Can we open the file? open(SOLVFILE, $file1) || die("Couldn't open file $file1\n"); print "Please type the filename with the structure: "; $file2 = <STDIN>; chomp $file2; #Does the file exist? unless ( -e $file2) { print "File \"$file2\" doesn\'t seem to exist!!\n"; exit; } #Can we open the file? open(STRUCTFILE, $file2) || die ("Couldn't open file $file2\n"); open(test1, ">Solvent.pdb"); open(test2, ">OverL.txt"); $i=0; while ($line=<SOLVFILE>) { chop($line); $rest1[$i] = $line; ($ATOM, $atomnb1[$i], $atmname1[$i],$resname1[$i],$resnb1[$i],$x1[$i], +$y1[$i],$z1[$i],$q1[$i],$rad1[$i])= split(/\s+/, $line); $i++; } $Nsolv = $i; print "Number of atoms in solvent file is $Nsolv \n"; $i = 0; while ($line=<STRUCTFILE>) { chop($line); ($ATOM, $atomnb2[$i], $atmname2[$i],$resname2[$i],$resnb2[$i],$x2[$i], +$y2[$i],$z2[$i],$q2[$i],$rad2[$i])= split(/\s+/, $line); $i++; } $Nstruct = $i; print "Number of atoms in structure file is $Nstruct \n"; for ($i = 0; $i <=$Nsolv-1; $i++) { $desBooli[$i] = 0; } for ($i = 0; $i <=$Nsolv-1; $i++) { for ($j = 0; $j <=$Nstruct-1; $j++) { $dist = sqrt(($x1[$i]-$x2[$j])*($x1[$i]-$x2[$j])+($z1[$i]-$z2[$j])*($z +1[$i]-$z2[$j])+($z1[$i]-$z2[$j])*($z1[$i]-$z2[$j])); $radSum = $rad1[$i] +$rad2[$j]; if($dist < $radSum) { print test2 "Overlap for atoms $atomnb1[$i] and $atomnb2[$j]\n"; $desBooli[$i] = 1; } } } $i = 0; for ($i = 0; $i <=$Nsolv-1; $i++) { if ($desBooli[$i] == 0){ print test1 "$rest1[$i]\n" } } close STRUCTFILE; close SOLVFILE; close test1; close test2; exit;

Replies are listed 'Best First'.
Re: checking for overlap
by graff (Chancellor) on Aug 08, 2003 at 07:15 UTC
    Thanks for the extra info. Let's talk rather about making the code more readable and maintainable, instead of "optimizing" it. (or maybe that's what you mean by "optimize"?)

    First off, your problem with  $rest1[$i] might be due to this:

    $i=0; while ($line=<SOLVFILE>) { chop($line); # use "chomp" here, not "chop" ... }
    If you're on a system that uses "\r\n" for line termination (i.e. windows), "chop" will only take away the "\n", leaving "\r", whereas "chomp" will take away both. (I'm not really sure that this will fix the problem, but you should use chomp anyway.)

    BTW, what value do you see when you print "Number of atoms in solvent file is $Nsolv \n" ?

    As for general improvements, many people prefer to feed file names to the script by putting them on the command line with the name of the script:

    my_script.pl file1 file2
    Then use @ARGV (or just "shift") to do things with the names within the script. Also, when reading from files a line at a time, using $_ instead of a named variable really helps keep it simple. Another thing is that you seem to be declaring variables and assigning file data to them, but never using the data, so you should try to reduce that clutter.

    Personally, I would prefer to structure the input data to cut down on the duplicated code -- your two input files seem to be organized the same way, so you could read both files into a single set of hashes, keyed by file name; each hash element would be an array of values to be used from the file (atom names, coords, radii). This means you can loop over the file names, and the code that reads and structures the data is needed in only on place in the script.

    (Oh, as I worked out the concept for myself, it looked to me like you have an error in the "sqrt" call -- you use the "z" values twice, and don't use the "y" values at all. Restructuring the code and data storage to reduce the amount of typing will help to avoid such mistakes.)

    Finally, proper indentation really does help. Here's one approach -- it parses properly, but I don't have data to test it on:

    #!/usr/bin/perl -w use strict; # painful at first, maybe, but worth it # using command line args is good (and easy): die "Usage: $0 solvent.file structure.file\n" unless ( @ARGV == 2 and -f $ARGV[0] and -f $ARGV[1] ); open( PDB, ">Solvent.pdb" ) or die "...:$!\n"; open( TXT, ">OverL.txt" ) or die "...:$!\n"; my %input; my %atmnb; # declare storage near where you need it my %coord; my %natms; for my $file ( @ARGV ) { open( IN, $file ) or die "open failed on $file: $!\n"; while (<IN>) { # use a slice on split to get just the needed data: my ($atb,$x,$y,$z,$rad) = (split)[1,5,6,7,9]; push @{$input{$file}}, $_; # keep full input line push @{$atmnb{$file}}, $atb; push @{$coord{$file}}, [ $x, $y, $z, $rad ]; } close IN; print "Number of atoms in $file is $.\n"; $natms{$file} = $.; } my ($f1,$f2) = @ARGV; for my $i ( 0 .. $natms{$f1}-1 ) { for my $j ( 0 .. $natms{$f2}-1 ) { # use map to compute the x,y,z diffs just once each: my @difs = map { $coord{$f1}[$i][$_] - $coord{$f2}[$j][$_] } ( + 0, 1, 2 ); my $dist = sqrt( $difs[0] * $difs[0] + $difs[1] * $difs[1] + $difs[2] * $difs[2] ); my $rsum = $coord{$f1}[$i][3] + $coord{$f2}[$j][3]; if ( $dist < $rsum ) { print TXT "Overlap for $atmnb{$f1}[$i] and $atmnb{$f2}[$j] +\n"; } else { print PDB "$input{$f1}[$i]"; } } }
    update: added the  [$i] [$j] indexes when using the $coord array in the computation/reporting loop. (Shouldn't have forgotten that...)

    Another update: regarding optimization, I don't know how much it would save, but in that N1 * N2 nested loop for measuring the distances, you could extract the "f1" data into simpler storage, and save the work of having to dereference the "f1" data structure on every iteration of the inner loop -- ie:

    for my $i ( 0 .. $natms{$f1}-1 ) { my $xyz = @{$coord{$f1}[$i]}; my $rad = pop @xyz; for my $j ( 0 .. $natms{$f2}-1 ) { # use map to compute the x,y,z diffs just once each: my @difs = map { $xyz[$_] - $coord{$f2}[$j][$_] } ( 0, 1, 2 ); my $dist = sqrt( $difs[0] * $difs[0] + $difs[1] * $difs[1] + $difs[2] * $difs[2] ); my $rsum = $rad + $coord{$f2}[$j][3]; if ( $dist < $rsum ) { print TXT "Overlap for $atmnb{$f1}[$i] and $atmnb{$f2}[$j] +\n"; } else { print PDB $input{$f1}[$i]; } } }
      If you're on a system that uses "\r\n" for line termination (i.e. windows), "chop" will only take away the "\n", leaving "\r", whereas "chomp" will take away both.

      Wrong. If you are Win32, then "\r\n" is used for line termination in files and the record you read in (unless you did binmode) will end with just "\n". $/ starts out as "\n" everywhere. See Re^4: Line Feeds (rumor control) for more on this.

      The reason chomp is preferred to chop for this is because chop will remove the last character of the line even if it isn't a newline.

                      - tye
Re: checking for overlap
by leriksen (Curate) on Aug 08, 2003 at 07:32 UTC
    I hard-coded the input file names for my own convenience while testing.

    somewhat optimised - I like to think of it as a good mix between obvious and efficient.

    #!/usr/bin/perl -w use strict; my @solvents; my @structs; my @fields = qw(atom number name resname resnumber xpos ypos zpos q ra +dius); slurp_file(\@solvents, "solv.txt"); slurp_file(\@structs, "struct.txt"); print "Atoms in solvent file is ", scalar(@solvents), "\n"; print "Atoms in structure file is ", scalar(@structs), "\n"; my @desBooli = map {0} @solvents; # could be done other ways open(test2, ">OverL.txt") or die ("Couldn't open file OverL.txt :: $!\n"); for (my $i = 0; $i < @solvents; $i++) { for (my $j = 0; $j < @structs; $j++) { # I am sure there is a better way my $dist = sqrt(($solvents[$i]->{xpos} - $structs[$j]->{xpos})**2+ ($solvents[$i]->{ypos} - $structs[$j]->{ypos})**2+ ($solvents[$i]->{zpos} - $structs[$j]->{zpos})**2) +; my $radSum = $solvents[$i]->{radius} + $structs[$j]->{radius}; if($dist < $radSum) { print test2 "$i - Overlap for atoms solvent $solvents[$i]->{resn +umber} and structure $structs[$j]->{resnumber}\n"; $desBooli[$i] = 1; } } } close test2; open(test1, ">Solvent.pdb") or die ("Couldn't open file Solvent.pdb :: $!\n"); for (my $i = 0; $i < @solvents; $i++) { if ($desBooli[$i] == 0){ # no overlap print test1 dump_solvent($solvents[$i]), "\n"; } } close test1; exit; sub slurp_file { my ($array, $name) = @_; open(FILE, $name) or die ("Couldn't open file $name:: $!\n"); while (<FILE>) { chomp; # trim line endings next if /^\s*$/; # strip blank lines my %hash = (); @hash{@fields} = split /\s+/; # hash slice push @{$array}, \%hash; } use Data::Dumper; close FILE; } sub dump_solvent { my ($solvent) = @_; join ' ', @{$solvent}{@fields}; }
Re: checking for overlap (index)
by tye (Sage) on Aug 08, 2003 at 18:19 UTC

    I'm a bit reluctant to go into this because implementing this would be a rather complex programming challenge. But I'll explain it as best I can in hopes that someone will find it helpful, educational, or at least interesting. (:

    When I previously had to find overlaps between two sets of shapes, I certainly couldn't wait to compare against every single shape in the database. So I had to come up with an index on the table that would allow me to quickly eliminate almost all of the items in the database as not being nearly close enough to have to worry about.

    The problem with a database index is that it is one-dimensional.

    I had previously had to deal with finding the nearest point (not overlapping shapes) and for that I really needed a two-dimensional key. But after some work I figured out that I could do pretty good by sorting the points into parallel "bands" that would define a linear (one-dimensional) ordering of all the points and allow me to build an index.

    So, if I made the bands W units thick and horizontal and wanted to find all point within 2*W units of some point, then I might end up doing a keyed search for records falling into the following 5 sections of bands:

    ---------------------------- [ ] ---------------------------- [ ] ---------------------------- [ . ] ---------------------------- [ ] ---------------------------- [ ] ----------------------------
    For each record, I built a key composed of int(Y/W) then X and then indexed on that compound key.

    For finding overlapping shapes, things get a bit more complicated. You can key each shape based on it's center. But you'd have to be able to assert that each shape is no larger than some maximum size or else you couldn't restrict your search.

    Based on your problem description, I think you might be able to just figure out your maximum radius and index your points into "shafts" and have a pretty fast solution. That is, for each point in the smaller set (or in the larger set if this set is used over and over with few changes against a bunch of different smaller sets) you build a key from the following values (in the order given):

    int(X/W), int(Y/W), Z
    where W is the width of your shafts.

    Then, if you need to check a sphere centered at (X1,Y1,Z1) and with radius R1, you need to look for all spheres in the database where their center is within R1+M of (X1,Y1,Z1). So loop over all int(X/W) between X1-R1-M and X1+R1+M:

    for my $xBand ( int(($X1-$R1-$M)/$W) .. int(($X1-$R1-$M)/$W) ) {
    Inside there, loop over all possible Y band coordinates. For simplicity you can just use:
    for my $yBand ( int(($Y1-$R1-$M)/W) .. int(($Y1-$R1-$M)/$W) ) {
    or you can do a little math and narrow down that range just a bit -- in a shaft that is already pretty far from (X1,Y1,Z1), you don't have to worry about Z being all the way out at Z-R1-M or Z+R1+M since that would be farther than R1+M from (X1,Y1,Z1). But we'll keep it simple in hopes of avoiding some bugs in what will already be fairly complex code.

    Now you are looping over all shafts that contain centers within R1+M of the current center of interest. So, for each shaft, look for centers in the range. We'll again not optimize this for the sake of simplicity:

    my $start= BuildKey( $xBand, $yBand, $Z1-$R1-$M ); my $end= BuildKey( $xBand, $yBand, $Z1+$R1+$M ); ... "select * from shapes where key between $start and $end" my $record; while( $record= Fetch( ... ) ) { if( Overlap( $record, ... ) ) { Report( ... ); return 1; } } } } return 0;

    But, what if you can't pick a reasonable maximum radius? Well, you just segregate shapes based on what range their radius is in and add a loop around all of the above code that iterates over each radius range.

    For example, you might realize that most of your radii are <= 2, a fraction of them are between 2 and 10, and a very few are over 10. In this case you might build a key as follows:

    my @bandWidth= ( 3, 15, 0 ); sub BuildKey { my( $radius, $x, $y, $z )= @_; my $sizeRange= $radius <= 2 ? 0 : $radius <= 10 ? 1 : 2; my $width= $bandWidth[$sizeRange]; if( 0 == $width ) { return ConcatForKey( $sizeRange, 0, 0, 0 ); } return ConcatForKey( $sizeRange, int($x/$width), int($y/$width), $z ); }
    Exactly what ConcatForKey() does as well as how to select and Fetch() depends on what you want to use for a database. You could use a standard SQL database and have records that look something like:
    real x real y real z real radius int size int xBand int yBand index size,xBand,yBand,z
    where you'll have to pick exactly which datatypes to use such that all of your points fit.

    Or you chould use a DBM file in which case you (pretty much) need to express each key as a string (of bytes) such that normal "ASCIIbetical" ordering (as done by cmp) sorts properly.

    Assuming that your X and Y coordinates never go beyond -2147483647*$Width nor 2147483647*$Width, then you can put big-endian longs into your key string and get a proper sort order. However, you have to shift the values so that negative numbers sort before positive numbers. We'll even put the Z values into "bands" so we can use the same strategy for putting them in the key that we do for the X and Y band numbers:

    sub BuildKey { my( $radius, $x, $y, $z )= @_; my $sizeRange= $radius <= 2 ? 0 : $radius <= 10 ? 1 : 2; my $width= $bandWidth[$sizeRange]; if( 0 == $width ) { return pack "c", $sizeRange; } for my $c ( $x, $y, $z ) { $c= int($c/$width); $c ^= 0x8000_0000; } return return pack "cNNN", $sizeRange, $x, $y, $z; }
    Note that this assumes 4-byte integers. If you have a version of Perl built with 8-byte integer support, then you'll probably have to change the ^= line. If you want to use 8-byte integers in your keys, then you'll also have to adjust the pack expression.
    Once you have this ASCIIbetical key, you can also decide to just hold the records in memory and, instead of letting a database use b-trees (or similar) to find interesting records, just use a binary search yourself. And if you go that route, it then becomes pretty easy to move much of the searching code into Inline::C to make it quite a bit faster (since it does quite a bit of fairly simple operations including arithmatic).

                    - tye
Re: checking for overlap
by ioana (Initiate) on Aug 08, 2003 at 05:34 UTC
    Questions:
    1. At line 58 I read  $rest1[$i] = $line;
    but I do not seem to read it correctly as I can't print it out
    in the final "for" loop. What is the correct way to do it?
    2. Can anyone help me optimize this code?
    Thanks,
    Ioana

      I don't see anything specifically wrong with line 58 and the final for loop. However I did find a serious typo, in your code to find the distance between the two atoms, you accidentally used the Z co-ordinate twice instead of Y. I'm not able to help you optimize the algorithm, but I did clean up the code so it is much more readable (and runs under strict). Allthough I did use 'constant' so all of the caveats for it apply. Be careful of PM wrapping this code. Oh and since you only provided one example line I tested this against some randomly generated data which may not have been correct :).

      #!/usr/bin/perl -w use strict; use constant { RAD => 9, X => 5, Y => 6, Z => 7, NB => 1, REST => 10, }; my $file; print "Please type the filename with the solvent: "; chomp($file = <STDIN>); die "File \"$file\" doesn\'t seem to exist!!\n" unless -e $file; open(SOLVFILE, $file) or die("Couldn't open file $file\n"); print "Please type the filename with the structure: \n"; chomp($file = <STDIN>); die "File \"$file\" doesn\'t seem to exist!!\n" unless -e $file; open(STRUCTFILE, $file) or die("Couldn't open file $file\n"); open(PDB, ">Solvent.pdb") or die "$!"; open(OVERL, ">OverL.txt") or die "$!"; my @solv; my $line; while ($line=<SOLVFILE>) { chomp $line; push @solv, [split /\s+/, $line]; push @{$solv[-1]}, $line; } close SOLVFILE; print "Number of atoms in solvent file is ",$#solv+1," \n"; my @struct; while ($line=<STRUCTFILE>) { chomp($line); push @struct, [split /\s+/, $line]; push @{$struct[-1]}, $line; } close STRUCTFILE; print "Number of atoms in structure file is ",$#struct+1," \n"; for (my $i = 0; $i <= $#solv; $i++) { my $v = $solv[$i]; for my $t (@struct) { if ( # Sum of radius is greater than distance # between the atoms ($v->[RAD] + $t->[RAD]) > sqrt( ($v->[X] - $t->[X])**2 + ($v->[Y] - $t->[Y])**2 + ($v->[Z] - $t->[Z])**2 )) { print OVERL "Overlap for atoms ", $v->[NB]," and ",$t->[NB],"\n"; } else { print PDB $v->[REST],"\n"; } } } close OVERL; close PDB; exit;

      HTH

      tedrek

      Update: I realized the whole @overlap functionality could be moved into the main loop, so I did :). Also you could change the outer for loop to to a while(<SOLV>) and not need to read SOLV into memory.

      Update2: After seeing graff's code I realized the print to PDB was supposed to be in a else block.

Re: checking for overlap
by graff (Chancellor) on Aug 08, 2003 at 05:27 UTC
    And um... what is the question/problem?