Re: Finding Temporary Files

Just a general comment: Since you're trying to create a list that maximizes the ease and efficiency of manual review, it would make more sense to do a suitable rank-sorting of the list, rather than categorization -- e.g. files most likely to be temporary (with file names that are not generated by humans) should dominate the top of the list. Ngram statistics would be a natural basis for ranking file names according to the likelihood that they are temp files.

To build a suitable "background" ngram model, it might be good to supplement (or replace) your dictionary with a "corpus" of non-temp-file names. For example, if you take all the file names that include punctuation (e.g. [-_+=. :]), split on punctuation, and count trigrams within chunks of 3 or more alphanumerics, you should have a more "realistic" set of probabilities for trigrams that make up non-temp file names.

Then it's just a matter of assigning a score to each file name in a given list (update: i.e. of file names that have no punctuation), such that names using a lot of improbable trigrams score very low, and those comprising mostly plausible (likely, frequent) trigrams score very high. Sort the list by score (lowest first), and files that come out on top are most likely to be the easiest for human judges to dismiss as obvious temp files.

And then it's just a matter of the judges deciding how far down the list they need to go in order to "finish" (because they've already found enough temp files to free up adequate space, or because they reach a point where there are too few temp files left to bother with).

Of course, I'd be tempted to include file size in the sorting somehow -- deleting bigger temp files first would be a big help. But I don't know how well that would apply to your case.

Comment on Re: Finding Temporary Files Download Code