tekkie has asked for the wisdom of the Perl Monks concerning the following question:

Monks, what you see before you is a simple conglomerate of code designed to run under WinNT 4.0

The task is simple, it polls a directory full of text files, scans each one for a given search string. If the file has the string, the script then identifies any other files that "look-like" the given file.

("look-like" defines to "has the same seven character key in the filename")

The problem is this, the script operates on directories containing several hundred files, each ranging in size from a few hundred bytes, to a few dozen megabytes. It appears straightforward enough (to me) that each file is open, checked for the string, then closed, however it seems that the memory being used to hold this file is not being released, and I get huge build-ups of wasted memory.

Is there something in Windows or PERL that I'm forgetting?
Does the problem lie elsewhere than the file handling?

(note: all IP addresses and paths have been changed to anonymous values for posting purposes)

#!c:\perl -w use IO::File; use Shell; use strict; my @servers = qw(0.0.0.0 1.1.1.1 2.2.2.2); my $searchString = join(" ", (@ARGV)); foreach my $server (@servers) { my $path = '\\\SERVER\VOLUME\DIRECTORY\*'; my $dirPath = '\\\SERVER\VOLUME\DIRECTORY'; $path =~ s/SERVER/$server/g; $dirPath =~ s/SERVER/$server/g; my @dir = dir("$path"); foreach(@dir) { if(/^.*?\s+(\S+\.(\S{3}))$/) { unless($2 eq 'log' || $2 eq 'txt') { my $file = join('\\', ($dirPath, $1)); (my $numSet) = $1 =~ /^.([^.]+)\.\S{3}$/; undef $/; my $read = new IO::File; if($read->open("< $file")) { if(<$read> =~ /$searchString/g) { print "Found: $file\n"; foreach(@dir) { if(/^.*?\s+([A-Z]{1}$numSet\.\S{3})$/) { my $found = join('\\', ($dirPath, $1)); print "Found: $found\n"; } } } } } } } }

Replies are listed 'Best First'.
Re: Where's the leak?
by dws (Chancellor) on Dec 23, 2002 at 20:51 UTC
    however it seems that the memory being used to hold this file is not being released, and I get huge build-ups of wasted memory.

    You have two problems. The first is a misunderstanding about how Perl uses memory. When you do something that requires more memory than Perl has available, Perl grows its internal memory pool by requesting more memory from the operating system. Perl then allocates internally from this pool. As far as I know, there's no provision at present for returning memory to the operating system. A common trick in long-running Perl applications is to save state in a file, then have the application re-exec yourself.

    The second problem is that you're using "slurp mode" to read the entire file at once. Unless you have a search pattern that extends across multiple lines, you can read, and search, the file line-by-line instead. The additional work this entails might be offset by the lower memory footprint it requries.

    Update: If your search patterns does span lines, you might consider the technique described in Matching in huge files.

      One bad thing about line-by-line in this users case though is that it will be much slower as he is reading these files over the network and the backend in windows will be way more effeciant if he pulls the whole file at once. That said if memory is more of a concern than speed line-by-line is the way to go here.

      -Waswas
        One bad thing about line-by-line in this users case though is that it will be much slower as he is reading these files over the network and the backend in windows will be way more effeciant if he pulls the whole file at once.

        Do you have evidence to support this? My experience says the opposite. For one, reading the file in slurp mode doesn't save substantial network traffic over reading it line-at-a-time, since disk pages are read and buffered to support per-line access. For another, assuming the pattern you're trying to match occurs once and is distributed randomly through the target file, on average you'll only need to read half the file to match it.

        Enter buffering. Perl doesn't read the file line by line, even if your code requests it that way.

        Makeshifts last the longest.

Re: Where's the leak?
by waswas-fng (Curate) on Dec 23, 2002 at 20:25 UTC
    I don't think you are seeing a _leak_ -- You seem to just be seeing perl try to be smart about memory usage. perl thinks "hey, this var needed 12 mb of space last time it was used, I will save this much memory for future usage instead of freeing it back to the system so I do not have to realocate it for this again". I bet that if you look at the size of memory the script is taking as it starts for the baseline and then run it against a 15 mb file you will see that it uses xmb more memory and holds onto it. Behind the scenes though you may notice that the memory is actually unused and swapped out (at least on unix it does this) and that only what is currently used is res.

    -Waswas
Re: Where's the leak?
by BrowserUk (Patriarch) on Dec 23, 2002 at 23:23 UTC

    One way from preventing the accumulation of memory that is slightly easier than having to save-state/exec/restore-state, is to spin that part of the processing that consumes large chunks off into a seperate process. When that process terminates, its memory is returned to the OS for re-use.

    In your case, as you only wish to know if the file contains the string, you put a line something like this at the top of your program...

    $cmd = q[ perl -MIO::File -we "exit( do{ local $/; my $io=new IO::File; $io->ope +n( $ARGV[0] ); <$io> } =~ /$ARGV[1]/ ) " ]; ];

    If your using unix, you could probably slit that over several lines using single quotes instead of doubles.

    Then replace these 4 lines...

    undef $/; my $read = new IO::File; if($read->open("< $file")) { if(<$read> =~ /$searchString/g) {

    with...

    if ( system( $cmd, $file, $searchstring) ) {

    Note: You will need to adjust the $cmd string to suit your OS. There are also many ways to improve it.

    For instance, you could do as dws suggested and process the file one line at a time rather than slurping it and bottle out as soon as you find the searchstring.

    You'll notice that I have removed the /g option from the match. There is no point in looking for more than one occurance unless you are going to do something with the knowledge.

    Also, as coded above, any failure in the script will be seen as a successful search. You should decode the return value from system, seperate the returns from perl itself from that from the exit in the one-liner. Or you could use the C-style double negative test; exit( !... ); and the  if ( !system(...) ) {.

    This way, your maximum memory usage should be perl X 2 + the biggest file you process.


    Examine what is said, not who speaks.

      A good suggestion, but it's begging for some decoupling.
      sub grepfile { my ($rx, $file) = @_; my $ret = system( qw/ perl -0777 /, -e => q{$a = shift; exit (<> !~ /$a/)}, $rx, $file ); return $ret == 0 if $ret != -1; require Carp; Carp::croak "Failed invoking perl for grepfile(): $!"; }
      Note how using system(LIST) eliminates any quoting headaches in one easy step.

      Makeshifts last the longest.

        Yup! Makes it much easier. That covers most of the "many ways to improve it" I hinted at, though I'm surpised you didn't throw in the -n and exit early. Still it is much clearer haw to add that with your version.


        Examine what is said, not who speaks.

Re: Where's the leak?
by tekkie (Beadle) on Dec 23, 2002 at 20:49 UTC
    I've added in the $read->close; as such:
    if(<$read> =~ /$searchString/g) { ... $read->close; }
    and the problem persists... is there a way to convince PERL to not remember the amount of memory the vars use each time?
      It isn't that Perl remembers how much memory a variable used as much as it is that Perl spent all that effort allocating it earlier, and isn't about to get rid of it just yet, as there is a high probability, in general, that if a lot of memory is needed once, it will be needed again.
Re: Where's the leak?
by tekkie (Beadle) on Dec 26, 2002 at 13:35 UTC
    Thanks to all for a lot of invaluable knowledge.

    In the end, I applied a variation on dws' sliding window technique in "Match in Huge Files." It's proven to completely eliminate the memory usage issue.

    The solution (a.k.a. 'The Way Tekkie Did It'):

    The addition of the following search subroutine:
    sub search { my ($file, $searchString) = @_; local *F; open(F, "<", $file) or return 0; binmode(F); my $nbytes = read(F, my $window, 2 * BLOCKSIZE); while ( $nbytes > 0 ) { if ( $window =~ m/$searchString/ ) { close(F); return 1; } $nbytes = read(F, my $block, BLOCKSIZE); last if $nbytes == 0; substr($window, 0, BLOCKSIZE) = ''; $window .= $block; } close(F); return 0; }

    Coupled with a modification to the main loop here:
    if(search($file, $searchString)) { print "\tFound: $file\n"; foreach(@dir) { if(/^.*?\s+([A-Z]{1}$numSet\.\S{3})$/) { my $found = join('\\', ($dirPath, $1)); print "\tFound: $found\n"; } } }
    And you've got another problem solved by the wisdom of the monks.
    Many thanks once more to all who offered their assistance.

Re: Where's the leak?
by waswas-fng (Curate) on Dec 23, 2002 at 20:29 UTC
    Also you may want to $read->close; after you are done with the file each time.

    -Waswas