comment on

All:
I have already come up with two solutions to my problem. So this is more a question of which one to implement based off how perl works internally. You can readmore for a verbose description:

I processing a very transient directory in an infinite loop. Basically, I read the first 64K of a file to look for specific information. If the information is present, the file gets moved out of the directory. If not, it is eventually moved by another process that I am in a race condition with. In between each cycle, I go to sleep for a period of time as to not chew up too much CPU. About once a minute, I go update my list of criteria for processing as it changes over time. What I would like to do is cache the file names I have already processed, so that I do not process the same file twice. The file names eventually get re-used, but if I invalidate my cache when I update my list, I can be assured that there are no issues. My idea is as follows:

Move on to the next file if it is in my cache

If not, check to see if it meets my criteria

If yes, move it off the directory and do nothing with cache

If no, add the file to my cache and move on to the next file

Once a minute, clear my cache

I could either push the filename to an array, or create a hash key

next if (exists $cache{$_});
or
next if (grep /\b$_\b/ , @cache);

and later on .....

push @cache , $_;
or
$cache{$_}++;
[download]

What are the dynamics of each approach? Internally does

%cache = ();
or
@cache = ();
[download]

and then re-creation have any impact as far as memory allocation/speed? Is there a point at which having more files in the cache give one approach a speed increase over the other? Is there a rule of thumb like if under 100 items, use A?

Basically, I am asking how does each process work internally so that I can decide which method to implement based off my dynamic environment. I can't really benchmark without live data. I can siphon off live data for file variation, but I can't replay it at the same speed as it happens in production so I never know how deep the directory will be.

Cheers - L~R

In reply to Internals question - "exists" for hash keys versus grep'ing array by Limbic~Region

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.