in reply to checking for overlap

Thanks for the extra info. Let's talk rather about making the code more readable and maintainable, instead of "optimizing" it. (or maybe that's what you mean by "optimize"?)

First off, your problem with  $rest1[$i] might be due to this:

$i=0; while ($line=<SOLVFILE>) { chop($line); # use "chomp" here, not "chop" ... }
If you're on a system that uses "\r\n" for line termination (i.e. windows), "chop" will only take away the "\n", leaving "\r", whereas "chomp" will take away both. (I'm not really sure that this will fix the problem, but you should use chomp anyway.)

BTW, what value do you see when you print "Number of atoms in solvent file is $Nsolv \n" ?

As for general improvements, many people prefer to feed file names to the script by putting them on the command line with the name of the script:

my_script.pl file1 file2
Then use @ARGV (or just "shift") to do things with the names within the script. Also, when reading from files a line at a time, using $_ instead of a named variable really helps keep it simple. Another thing is that you seem to be declaring variables and assigning file data to them, but never using the data, so you should try to reduce that clutter.

Personally, I would prefer to structure the input data to cut down on the duplicated code -- your two input files seem to be organized the same way, so you could read both files into a single set of hashes, keyed by file name; each hash element would be an array of values to be used from the file (atom names, coords, radii). This means you can loop over the file names, and the code that reads and structures the data is needed in only on place in the script.

(Oh, as I worked out the concept for myself, it looked to me like you have an error in the "sqrt" call -- you use the "z" values twice, and don't use the "y" values at all. Restructuring the code and data storage to reduce the amount of typing will help to avoid such mistakes.)

Finally, proper indentation really does help. Here's one approach -- it parses properly, but I don't have data to test it on:

#!/usr/bin/perl -w use strict; # painful at first, maybe, but worth it # using command line args is good (and easy): die "Usage: $0 solvent.file structure.file\n" unless ( @ARGV == 2 and -f $ARGV[0] and -f $ARGV[1] ); open( PDB, ">Solvent.pdb" ) or die "...:$!\n"; open( TXT, ">OverL.txt" ) or die "...:$!\n"; my %input; my %atmnb; # declare storage near where you need it my %coord; my %natms; for my $file ( @ARGV ) { open( IN, $file ) or die "open failed on $file: $!\n"; while (<IN>) { # use a slice on split to get just the needed data: my ($atb,$x,$y,$z,$rad) = (split)[1,5,6,7,9]; push @{$input{$file}}, $_; # keep full input line push @{$atmnb{$file}}, $atb; push @{$coord{$file}}, [ $x, $y, $z, $rad ]; } close IN; print "Number of atoms in $file is $.\n"; $natms{$file} = $.; } my ($f1,$f2) = @ARGV; for my $i ( 0 .. $natms{$f1}-1 ) { for my $j ( 0 .. $natms{$f2}-1 ) { # use map to compute the x,y,z diffs just once each: my @difs = map { $coord{$f1}[$i][$_] - $coord{$f2}[$j][$_] } ( + 0, 1, 2 ); my $dist = sqrt( $difs[0] * $difs[0] + $difs[1] * $difs[1] + $difs[2] * $difs[2] ); my $rsum = $coord{$f1}[$i][3] + $coord{$f2}[$j][3]; if ( $dist < $rsum ) { print TXT "Overlap for $atmnb{$f1}[$i] and $atmnb{$f2}[$j] +\n"; } else { print PDB "$input{$f1}[$i]"; } } }
update: added the  [$i] [$j] indexes when using the $coord array in the computation/reporting loop. (Shouldn't have forgotten that...)

Another update: regarding optimization, I don't know how much it would save, but in that N1 * N2 nested loop for measuring the distances, you could extract the "f1" data into simpler storage, and save the work of having to dereference the "f1" data structure on every iteration of the inner loop -- ie:

for my $i ( 0 .. $natms{$f1}-1 ) { my $xyz = @{$coord{$f1}[$i]}; my $rad = pop @xyz; for my $j ( 0 .. $natms{$f2}-1 ) { # use map to compute the x,y,z diffs just once each: my @difs = map { $xyz[$_] - $coord{$f2}[$j][$_] } ( 0, 1, 2 ); my $dist = sqrt( $difs[0] * $difs[0] + $difs[1] * $difs[1] + $difs[2] * $difs[2] ); my $rsum = $rad + $coord{$f2}[$j][3]; if ( $dist < $rsum ) { print TXT "Overlap for $atmnb{$f1}[$i] and $atmnb{$f2}[$j] +\n"; } else { print PDB $input{$f1}[$i]; } } }

Replies are listed 'Best First'.
Re^2: checking for overlap ($/)
by tye (Sage) on Aug 08, 2003 at 14:41 UTC
    If you're on a system that uses "\r\n" for line termination (i.e. windows), "chop" will only take away the "\n", leaving "\r", whereas "chomp" will take away both.

    Wrong. If you are Win32, then "\r\n" is used for line termination in files and the record you read in (unless you did binmode) will end with just "\n". $/ starts out as "\n" everywhere. See Re^4: Line Feeds (rumor control) for more on this.

    The reason chomp is preferred to chop for this is because chop will remove the last character of the line even if it isn't a newline.

                    - tye