comment on

Just a general comment: Since you're trying to create a list that maximizes the ease and efficiency of manual review, it would make more sense to do a suitable rank-sorting of the list, rather than categorization -- e.g. files most likely to be temporary (with file names that are not generated by humans) should dominate the top of the list. Ngram statistics would be a natural basis for ranking file names according to the likelihood that they are temp files.

To build a suitable "background" ngram model, it might be good to supplement (or replace) your dictionary with a "corpus" of non-temp-file names. For example, if you take all the file names that include punctuation (e.g. [-_+=. :]), split on punctuation, and count trigrams within chunks of 3 or more alphanumerics, you should have a more "realistic" set of probabilities for trigrams that make up non-temp file names.

Then it's just a matter of assigning a score to each file name in a given list (update: i.e. of file names that have no punctuation), such that names using a lot of improbable trigrams score very low, and those comprising mostly plausible (likely, frequent) trigrams score very high. Sort the list by score (lowest first), and files that come out on top are most likely to be the easiest for human judges to dismiss as obvious temp files.

And then it's just a matter of the judges deciding how far down the list they need to go in order to "finish" (because they've already found enough temp files to free up adequate space, or because they reach a point where there are too few temp files left to bother with).

Of course, I'd be tempted to include file size in the sorting somehow -- deleting bigger temp files first would be a big help. But I don't know how well that would apply to your case.

In reply to Re: Finding Temporary Files by graff
in thread Finding Temporary Files by eff_i_g

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.