in reply to Where's the leak?

however it seems that the memory being used to hold this file is not being released, and I get huge build-ups of wasted memory.

You have two problems. The first is a misunderstanding about how Perl uses memory. When you do something that requires more memory than Perl has available, Perl grows its internal memory pool by requesting more memory from the operating system. Perl then allocates internally from this pool. As far as I know, there's no provision at present for returning memory to the operating system. A common trick in long-running Perl applications is to save state in a file, then have the application re-exec yourself.

The second problem is that you're using "slurp mode" to read the entire file at once. Unless you have a search pattern that extends across multiple lines, you can read, and search, the file line-by-line instead. The additional work this entails might be offset by the lower memory footprint it requries.

Update: If your search patterns does span lines, you might consider the technique described in Matching in huge files.

Replies are listed 'Best First'.
Re: Re: Where's the leak?
by waswas-fng (Curate) on Dec 23, 2002 at 20:56 UTC
    One bad thing about line-by-line in this users case though is that it will be much slower as he is reading these files over the network and the backend in windows will be way more effeciant if he pulls the whole file at once. That said if memory is more of a concern than speed line-by-line is the way to go here.

    -Waswas
      One bad thing about line-by-line in this users case though is that it will be much slower as he is reading these files over the network and the backend in windows will be way more effeciant if he pulls the whole file at once.

      Do you have evidence to support this? My experience says the opposite. For one, reading the file in slurp mode doesn't save substantial network traffic over reading it line-at-a-time, since disk pages are read and buffered to support per-line access. For another, assuming the pattern you're trying to match occurs once and is distributed randomly through the target file, on average you'll only need to read half the file to match it.

        I have ran into a few projects using C where mmapped files over network mounts on windows were dropped in favor of a full read of the file in order to get the low level windows networking code to burst the file. In this case though your point may be true, I guess if the match happens randomly in the file there would be no need to have the whole file transfered across the network. I my cases I need access to the whole file every time. A good example between memmapped file access and a full open/read triggering the burst mode is simple though, copy with explorer almost always triggers the burst mode -- try installing ms office across a network drive ( the installer mmaps the cabs) time it, then time copying the files across and installing. Or a perl only test is a slurp and dump local vs line by line dump to local on a large text file. dws++ for bringing up a point I completly missed though.

        Edited:
        Also as far as I know the bursting mode does not work on samba servers as far as I know.

        -Waswas
      Enter buffering. Perl doesn't read the file line by line, even if your code requests it that way.

      Makeshifts last the longest.

        I may be out of date, but doesnt perl use its PerlIO layer wich just indirectly uses fseek,fwrite and ftell ad nauseum to mem map files for line-by-line? Last time I looked I dont think I saw that it buffered the entire file, which is what you need to do (ie read the entire file in one swoop) to get windows busrting mode to kick in.

        stdio
        Layer which calls fread, fwrite and fseek/ftell etc. Note that as this is "real" stdio it will ignore any layers beneath it and got straight to the operating system via the C library as usual.

        perlio
        This is a re-implementation of "stdio-like" buffering written as a PerlIO "layer". As such it will call whatever layer is below it for its operations.

        -Waswas