Pearte has asked for the wisdom of the Perl Monks concerning the following question:

Probably an easy one ...

I have a piece of code that hashes a directory structure using the use File::Recurse package. It looks like ...
%files = Recurse([$drive], {match => '\.'});
... and then I can shuffle through it like so ...
foreach $dir (sort keys %files) { foreach $file (@{ $files{$dir} }) { # Do work } }
The trouble is that there are literally millions of files in the hash and as you can imagine, that's taking up a lot of memory ... don't ask. The process is able to hash the entire structure without any trouble but somewhere in loop of the program I get "Out Of Memory" errors. Undoubtedly this is the result of the aforementioned hash. I don't need to keep any element of that hash around after I have used it, so my question is, "How do I delete the entries from the hash one-by-one? In other words, how do I deallocate the memory of an entry in the hash individually or is it even possible? Obviously, I hope this would allow me to avoid the memory errors, thought I fear that might happen at the expense of a great deal of memory management overhead.

Between a rock and a hard place,
Pearte

PS - If anyone has a better strategy for dealing with as many files, I'd like to hear it.

Replies are listed 'Best First'.
•Re: Hash Entry Deallocation
by merlyn (Sage) on Jun 30, 2003 at 17:48 UTC
Re: Hash Entry Deallocation
by chromatic (Archbishop) on Jun 30, 2003 at 17:49 UTC

    Guessing at your data structure, you could use a line like this right after the inner loop:

    delete $files{$dir};

    It might work better if you were to use a tied hash that stored its contents on disk instead of in memory. That presumes you have enough disk space to hold your entries.

    Do you need to build the complete data structure first or could you get away with File::Find callbacks to process files as they're encountered?

Re: Hash Entry Deallocation
by Elian (Parson) on Jun 30, 2003 at 17:51 UTC
    Well, for the specific question, use delete, which deletes elements from a hash. There's a limit to how much memory you'll get back as parts of the hash structure won't get freed, but the individual elements will. Looks like:
    delete $foo{somekey};
    For this specific problem, have you considered using File::Find instead? It will invoke a callback routine for each file so it doesn't have to pre-load in all the filenames and thus doesn't take nearly so much memory.
Re: Hash Entry Deallocation
by gjb (Vicar) on Jun 30, 2003 at 17:48 UTC

    Do you really have to store all directory entries or could you just do "Do work" on each entry without knowing anything about other entries?

    If that's the case, you could consider using File::Find which does a traversal of the file system, enabling you to run some code on each matching (all if you want) entry it finds.

    Just my 2 cents, -gjb-

Re: Hash Entry Deallocation
by halley (Prior) on Jun 30, 2003 at 17:54 UTC
    In terms of the size of a structure in Perl, there's a few separate considerations.
    • the hash bins themselves; this grows automatically to try to keep the collision rate low, but I don't think it ever shrinks, even if pairs are deleted. On each growth, all the keys are rehashed internally, similar to pouring all the water from a small bucket to a larger bucket. You might try to reverse the process by manually pouring a half-empty bucket back into a smaller bucket occasionally: it might be as simple as doing an %hash = (%hash) if (eval scalar %hash) < 0.1; but I haven't tested this.
    • the elements themselves. Is your hash storing references to large structures? Are these really reaped when you remove them from the hash, or are you forgetting some other variable that's also keeping these references alive?
    • Perl never hands memory back to the operating system. It's not Perl's fault, it's the way memory works in most modern operating systems. When Perl reaps an object, the memory is free for Perl to use again, but not free for Excel or XFree to use until Perl exits.
    • A 32bit machine has certain limitations about how much memory one process can reach: 4GB is the numerical limit, but it's usually cut to 2GB-3.5GB depending on kernel or operating system overhead and design. If you want to increase that, you'll have to go to a 64bit OS, or get creative about persistent disk-backed storage and auto-restoration for active entities.
    • Most large-data algorithms can be rewritten as a stream-data algorithm instead. Do you need to keep all those entries for so long? Sometimes you do, and oftentimes, you don't. Rethink the core goals of your algorithm and see if there's another way.

    --
    [ e d @ h a l l e y . c c ]

      Perl never hands memory back to the operating system. It's not Perl's fault, it's the way memory works in most modern operating systems. When Perl reaps an object, the memory is free for Perl to use again, but not free for Excel or XFree to use until Perl exits.
      Some day we'll manage to stomp this particular meme out, as it's based on a misfeature originally wired into Unix, one that has been fixed for quite a number of years for most versions. Modern operating systems do give memory back to the system, and older operating systems (Unix, remember, is 33 -- hardly modern) with recent versions generally do as well.
        I think this is quibbling over semantics; where one may start to say things like "memory is not RAM." This actually depends on how you look at it.

        This is my understanding, which may be out of date. I'm always up for a fresher course in modern operating system design.

        From the application, and its itinerant malloc()'s point of view, the memory space is a contiguous resource and once available, is always available. From the kernel's point of view, RAM pages that have not been accessed lately may be cached to some persistent store and then repurposed for other processes with a new virtual address. However, malloc() may choose to revisit that once-free()'d range and then the kernel needs to arrange some RAM to match it. A process which frees a chunk of memory in the middle of its address range does not shrink on the process list accounting; instead, the kernel over-books the RAM for multiple processes needing memory, and thrashes when the over-booking becomes significant enough that processes are contending for more RAM than is physically available.

        --
        [ e d @ h a l l e y . c c ]