Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All:
I have already come up with two solutions to my problem. So this is more a question of which one to implement based off how perl works internally. You can readmore for a verbose description:

I processing a very transient directory in an infinite loop. Basically, I read the first 64K of a file to look for specific information. If the information is present, the file gets moved out of the directory. If not, it is eventually moved by another process that I am in a race condition with. In between each cycle, I go to sleep for a period of time as to not chew up too much CPU. About once a minute, I go update my list of criteria for processing as it changes over time. What I would like to do is cache the file names I have already processed, so that I do not process the same file twice. The file names eventually get re-used, but if I invalidate my cache when I update my list, I can be assured that there are no issues. My idea is as follows:

  • Move on to the next file if it is in my cache
  • If not, check to see if it meets my criteria
  • If yes, move it off the directory and do nothing with cache
  • If no, add the file to my cache and move on to the next file
  • Once a minute, clear my cache
  • I could either push the filename to an array, or create a hash key
    next if (exists $cache{$_}); or next if (grep /\b$_\b/ , @cache); and later on ..... push @cache , $_; or $cache{$_}++;

    What are the dynamics of each approach? Internally does

    %cache = (); or @cache = ();
    and then re-creation have any impact as far as memory allocation/speed? Is there a point at which having more files in the cache give one approach a speed increase over the other? Is there a rule of thumb like if under 100 items, use A?

    Basically, I am asking how does each process work internally so that I can decide which method to implement based off my dynamic environment. I can't really benchmark without live data. I can siphon off live data for file variation, but I can't replay it at the same speed as it happens in production so I never know how deep the directory will be.

    Cheers - L~R

    Replies are listed 'Best First'.
    Re: Internals question - "exists" for hash keys versus grep'ing array
    by dragonchild (Archbishop) on Mar 19, 2003 at 19:04 UTC
      Caches are almost exclusively the territory of hashes. grep is fast, but it has to do searching which has a O(log n) at best. Hashing does searching at (essentially) O(1). That's about as fast as is possible (that we know of).

      Now, the classic tradeoff for speed is memory. You have to remember more with hashes, cause every value has a key associated with it. Arrays, on the other hand, just store that value.

      From what you've described, speed is much more important to you than memory, especially since you reset your cache every so often. (This doesn't free the memory to the OS, but it does free it back to the Perl interpreter.)

      Now, there's an even better thing with hashes with regards to memory. Perl pre-allocates a certain amount of space whenever you use any variable, be it a scalar, array, or hash. (This, as you might guess, is for speed considerations.) With arrays, the number I've most often heard is that around 50 scalars-worth are pre-allocated. Hashes pre-allocate 16 scalars-worth (or less) - 2 scalars per bucket. Now, hashes grow pretty quickly (for example, 65 scalars would need space for 100 in a list, but 128 in a hash), but if you're not going beyond 32, you're still in the same ballpark for space.

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        Thanks dragonchild!
        I assumed the hash was fastest in lookup, but might be slower in creating. I wanted to see the tradeoff.

        Cheers - L~R

    Re: Internals question - "exists" for hash keys versus grep'ing array
    by diotalevi (Canon) on Mar 19, 2003 at 19:32 UTC

      $cache{$_}++;
      That is poor form especially for your application. You aren't counting how often the element is present - you just care about whether the key exists. $cache{$_} = undef is better because the value isn't actually created and doesn't have to go through the whole postinc process. So its lighter weight on memory and on performance.