Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi there I have a text file that looks like this ..
-112.6 115.0 -120.8 153.7 -121.8 118.7 -118.5 142.8 -144.1 146.9 -119.7 124.5 -118.0 119.2
And I am writing a perl script to take each pair of values, place them in an array element according to whether they fit into a certain range ...for example, Phi (first value) between -180 and -175 and Psi (second value) between 60 and 70.
Thanks to all the help I have received from people on here recently, I now seem to have at least a semi working piece of code, as shown below:
#!/usr/bin/perl use warnings; # from command line. This file is a text file of phi psi angles as sel +ected # from ramachandran_3DplotA.pl script $input = $ARGV[0]; $output = $ARGV[1]; $n = 0; $o = 0; $v = 0; if(($ARGV[0] eq "-h") || ($ARGV[0] eq "")) { print "\n* Ramachandran_3D Plot* program\n"; print "-------------------------------------------\n"; print "Usage: perl ramachandran_3DplotB.pl <input file> <output fi +le>\n"; print "-------------------------------------------\n"; print "Prints out file for use in Graphis plotting program\n\n"; exit(1); } open(FILE, "$input") || die "ERROR: Unable to open $input FILE for rea +ding: $!\n"; open(OUT, ">>$output") || die "ERROR: Unable to open $output FILE for +writing: $!\n"; @file = <FILE>; for($i = -180; $i < 181; $i = $i+5) { $j = $i+5; $n++; $o = 0; for($k = -180; $k < 181; $k = $k+5) { $m = $k+5; $o++; @ranges = ([$i, $j, $k, $m]); @range_counts = (0) x @ranges; for $R (0..$#ranges) { for($f = 0; $f <@file; $f++) { $phi = substr($file[$f], 0, 6); $psi = substr($file[$f], 8, 6); print "$phi $psi\n"; if ($phi >=$ranges[$R][0] and $phi < $ranges[$R][1] and $psi >= $ranges[$R][2] and $psi < $ranges[$R][3]) { ++$range_counts[$R]; } } } for $R (0..$#ranges) { if($v < 73) { printf OUT "%d, ", $range_counts[$R]; $v++; } else { print OUT "\n"; $v = 0; } } } }
This script now works just fine for a reasonably small test file. But it doesnt fare so well on my actual file (which is about 22MB in size) .. in fact it just sits there and does nothing. I'm not sure quite why this is, would it be because the script as is is just terribly inefficient, or would it be that it simply cant handle large quantities of data, or what?
Any advice much appreciated

Replies are listed 'Best First'.
Re: inefficient code? works on small files ok but not larger ones
by dragonchild (Archbishop) on Oct 05, 2004 at 14:44 UTC
    Your problem is that you're iterating through the file over and over and over. You go through each line in the file N times, where N is the number of ranges. And you do that 72 times. So, your runtime is 72 * Ranges * lines_in_file. Oops!

    Plus, the line @file = <FILE>; puts the entire file into RAM. A 22MB file translates to, probably, 100MB - 200MB in RAM. Add the overhead of Perl and anything else you're doing and you could be using 300MB+ of RAM. If you're doing this on a smaller machine, you could be swapping memory like crazy, which will kill your program's runtime. (And, it might even kill the machine.)

    A much better idea would be to pre-process the file, maybe by loading it into a database. Then, do your work against that. Another idea would be to sort the file. That way, you can short-circuit your processing.

    The important thing is that you have a program that works. Optimizing for speed once you have accuracy down is good. Optimizing for accuracy once you have speed down is ... harder.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: inefficient code? works on small files ok but not larger ones
by davido (Cardinal) on Oct 05, 2004 at 14:46 UTC

    This is a good example of what people mean when they say, "That solution doesn't scale well.

    The problem is that the algorithm you've devised to partition the data by range runs in something like O(n^3) time (or worse), which means that as your dataset grows by n elements, the work being done grows by at least n^3 (again, it might be worse in final analysis). You have a construct of four 'for' loops nested, and then two 'for' loops nested later.

    As the dataset grows, you go from a relatively small number of iterations, to billions of iterations, rather quickly, and the program appears to grind to a halt -- in reality it's working really really hard.

    Couldn't you predeclare your ranges, and then in a single loop test which ranges each line of data fits into?

    Your next problem is that you're slurping the entire file. That means that you must hold the entire 22mb file in memory. Plus, the array you're generating must fit in memory too. You're probably getting to a point where your operating system cries uncle and starts using hard-disk swap space. With a lot of swapfile churning, you also get dreadfully slow performance. It would work a lot better if you could read your data in line by line and process it, one line at a time.


    Dave

Re: inefficient code? works on small files ok but not larger ones
by tall_man (Parson) on Oct 05, 2004 at 15:57 UTC
    Given the simplicity and regularity of your ranges, you could simply compute indices of a two-dimensional array instead, like this:
    use strict; use POSIX qw(floor); my $input = $ARGV[0]; open(FILE, $input) || die "ERROR: Unable to open $input FILE for readi +ng: $!\n"; my @range2; my ($phi, $psi); while (<FILE>) { $phi = substr($_, 0, 6); $psi = substr($_, 8, 6); print "$phi $psi\n"; my $i = floor($phi/5.0) + 36; my $j = floor($psi/5.0) + 36; if ($i >= 0 && $j >= 0 && $i < 73 && $j < 73) { ++$range2[$i][$j]; } } for my $i (0..72) { for my $j (0..72) { my $val = $range2[$i][$j]; $val ||= 0; print "$val, "; } print "\n"; }
      I agree -- this sort of approach makes a lot more sense, and will scale very well to any amount of input. I'd just make one minor adjustment:
      while (<FILE>) { # $phi = substr($_, 0, 6); # $psi = substr($_, 8, 6); ( $phi, $psi ) = split; ...
      The OP didn't show us any lines of data with single-digit numbers in one or both columns. Such lines might or might not have fixed-width fields, and if not, substr() would do the wrong thing; split() would always do the right thing.

      (For that matter, it's not clear whether the column separator is two spaces or a single tab character -- again, split works in any case, while substr would get tripped up when you guess wrong about that.)

Re: inefficient code? works on small files ok but not larger ones
by duff (Parson) on Oct 05, 2004 at 14:47 UTC

    Firstly, you slurp the entire file into memory with the line that reads @file = <FILE>;. Secondly, you loop over each line of the file many many times. Also, not relating to your problem per se, but you are using an array that contains a single array reference and it doesn't look like there's any reason for that (@ranges will always have one element, the reference to your 4 values. Similar for @range_counts since it's built from @ranges).

Re: inefficient code? works on small files ok but not larger ones
by jaco (Pilgrim) on Oct 05, 2004 at 14:39 UTC
    well, depending on what the specs of your machine is. I'd say that
    @file = <FILE>;
    could be part of the problem. Especially if you have limited RAM.
Re: inefficient code? works on small files ok but not larger ones
by mutated (Monk) on Oct 05, 2004 at 16:11 UTC
    The code would be easier to read ( and thus easier to comment on and make more efficent ) if you changed a couple things, first try using a little more descriptive names than i,j,k,m,n and o. Also as it doesn't appear that you modify the value of i/j or k/m within the context of one loop just get rid of j and m refer to them instead as (i+5) and (m+5). I think also changing for $R (0..$#ranges) to foreach $range (i,i+5,m,m+5) it wont make your code run any faster, as others have noted above, reading in a 22MB file into an array, and itterating through that array that many times just isn't a good thing, but it's a good place to start.


    daN.