Re: open - Unbuffered Write???

Now that you know how to control the buffering behavior on an output file, you might want to look for ways to speed up your code in general.

How big are these two files? If just one of them is not too big to fit into memory, load all the data from that one file into an array, then go through the other file line by line and compute all the distances relative to the array elements.

If neither file will fit entirely in memory at one time, you could still speed things up a lot by reading a large number of records from the first file into an array, then for each line of the second file, compute all the distances relative to the current array; then read another chunk of file1 into the array and repeat. The point is to reduce the number of times you have to open and read the contents of the second file. Something like this:

while ( !eof FIRST ) {
    my $i = 0;
    my ( @first_lats, @first_lons );
    while ( !eof FIRST and $i < 10000 ) {
        $_ = <FIRST>;
        if ( $csv->parse( $_ )) {
            my ( $lat, lon ) = $csv->fields;
            $first_lats[$i] = $lat;
            $first_lons[$i] = $lon;
            $i++;
        } else {
            # report csv error
        }
    }
    check_distances( $i, \@first_lats, \@first_lons );
}

sub check_distances
{
    my ( $n, $flat, $flon ) = @_;

    open SECOND, "second.file" or die $!;
    while (<SECOND>) {
       if ($csv->parse($_)) {
           my ( $slat, $slon ) = $csv->fields;
           for ( my $i=0; $i<$n; $i++ ) {
               # check distance from $slat,$slon to $$flat[$i],$$flon[
+$i]
           }
       }
    }
}
[download]

(update: fixed problems in array declaration and array indexing -- still not tested, of course)

Another thing that will probably speed it up is to figure out what latitude difference is greater than 1000 feet; if any two points differ by more than that amount in latitude, you can skip the more complicated lat-lon distance computation.

(Doing the same for longitude is a little trickier, because you have to know in advance the highest value for latitude that you'll ever see in the data, and figure out the longitude distance that equals 1000 feet at that latitude. But if you can do that, it will save some run time.)

Finally, if your data files are reliably simple -- just two comma-separated numeric values per line -- you might save a lot of run time by just using "split" or regex matching instead of Text::CSV_XS. I'm not completely sure of that, but if this job is running for hours or days, it would be worth a benchmark test to find out, if you haven't done that already.

Comment on Re: open - Unbuffered Write??? Download Code

Replies are listed 'Best First'.
Re^2: open - Unbuffered Write??? by awohld (Hermit) on Aug 04, 2005 at 17:12 UTC
These files are about 400K each. One file has about 26,000 locations and the other is about 10,800 locations. At the rate it's processing it'll take me about 3.5 days on my PII 400 MHz. I'll have to try your code, and see how much faster it'll run.	[reply]
Re^3: open - Unbuffered Write??? by graff (Chancellor) on Aug 04, 2005 at 17:51 UTC
You should not have any trouble holding the full content of a 400KB file in memory. Read one whole file into an array, then run through that array for each line of the other file.	[reply]