Smersh2000 has asked for the wisdom of the Perl Monks concerning the following question:

I run into this problem - i have ascript that deals with a decent-size array of arrays - about 160,000 by 300. i load it from a file, do some magic on it (parsing,searchign for certain records and deleting the rows that i need), and then write it into another output file. The problem is that, judging from the messages on the screen it takes few hours to close the filehandle for one reason or another. What am i doing wrong? Any help is appreciated. TS
open (INFILEHANDLE, "<:crlf", $detail_file_in) or die "Can't open $de +tail_file_in! \n"; $counter = 0; while ($line = <INFILEHANDLE>) { chomp $line; @{$in_array[$counter]} = split($in_delim, $line); $counter++; } close INFILEHANDLE; #process the details file $target = 'details file'; patch_stockfiles(*LOGFILEHANDLE, *FILTERLOGFILEHANDLE, \@in_array, $ta +rget); delete_duplicates(*LOGFILEHANDLE, *FILTERLOGFILEHANDLE, \@in_array, $t +arget); fix_format(*LOGFILEHANDLE, *FILTERLOGFILEHANDLE, \@in_array, $target); ($sec,$min,$hour,$day,$mon,$year) = (localtime)[0..5]; $date = sprintf "%d%02d%02d",$year+1900,$mon+1,$day; $preform_date = $date . "\t" . $hour . ":" . $min . ":" . $sec; print LOGFILEHANDLE $preform_date . "\tWriting Detail File..." +; print $preform_date . "\tWriting Detail File..." +; #print into a file open (OUTFILEHANDLE, ">" . $detail_file) or die "Can't open $detail_fi +le \n"; for ($counter = 0; $counter <= $#in_array; $counter++) { print OUTFILEHANDLE join($in_delim, @{$in_array[$counter]}) . "\n" +; } @in_array = (); print LOGFILEHANDLE "Ok\n"; print "Ok\n"; close OUTFILEHANDLE;

Replies are listed 'Best First'.
Re: Perl Filehandle?
by Joost (Canon) on Nov 14, 2006 at 16:09 UTC
Re: Perl Filehandle?
by Fletch (Bishop) on Nov 14, 2006 at 16:04 UTC

    160,000 * 300 is 45M items (48,000,000). Multiply that by whatever the size of each element is plus Perl's internal overhead and you're talking a good amount of memory. Since you're reading the entire contents of your file into RAM most likely what's happening is you're spending most of your time waiting for the OS to swap things in and out of RAM. Consider using something like BerkelyDB or the like to provide random access to your data from an on-disk hash instead.

Re: Perl Filehandle?
by wojtyk (Friar) on Nov 14, 2006 at 17:17 UTC
    Are you sure the slowdown is occurring at the filehandle close and not somewhere else in the program?

    Do all those arrays need to remain in memory for the duration of the program?
    Not knowing what your functions do makes it hard to offer suggestions.

    Have you tried moving patch_stockfiles/delete_duplicates/fix_format/output inside the initial input loop and eliminating the array entirely?
    Based on those subroutine names, it seems as if at least some of that can be done from inside the loop.
    As is, it seems as if you're iterating over that huge array 4 separate times after creating it.

    My suggestion would be to find a way to process this data piecewise (or in tandem), rather than as a whole. Is this possible?

      Thank you for reply, Originally, that is what i had in my mind - work on the files without loading the whole file into memory, butthen i though - given that i need to parse it 1 time for patching, 2 times for deleting duplicates and once for fixing the format, wouldn't i lose time on opening/closing files? I think i even tried to do this once in another script, and opening file alone would take some time. Was i wrong?
        File operations aren't always as white and black as they seem. The operating system does alot of file caching behind the scenes, so you probably won't take as bad a performance hit as you'd think if you open and close the same file several times.

        Have you tried Tie::File yet?
        It also does caching and deferred writes and other types of optimization.
        More importantly, you can give an upper limit on the amount of memory you want Tie::File to consume, which could possibly prevent excessive swapping.

        However, that discussion aside...my main point (which I think you might have missed) was that I don't think you have to parse it 4 times. I could be wrong (as I don't know all the facts), but can't any of this be done in tandem? Ie, why can't the format fix be done at the same time as the patch?

        You would be patching and formatting lines (not arrays) of data on the fly. You only need a single iteration of all that data, instead of several. You could probably even do the dup checking at the same time. Just build a hash of "things seen" as you're patching/formating, and skip any dups that appear in the hash. Pseudo-code for what I'm talking about:

        while( $line = <INFILEHANDLE> ) { chomp($line); $seen{$line} = 1; # for dup checking if ($seen{$line}) { next; } # for dup skipping/deleting patch_line($line); format_line($line); # ... any other code ... print OUTFILEHANDLE $line; }