in reply to Finding Temporary Files

eff_i_g,
Here are some things you might want to consider. First, when I am creating a temporary file it is almost always called foo (foo.pl, foo.csv, etc). You might want to include things like foo/bar/blah/asdf to your list of candidates. Also, I often create a directory called backup or archive where I still files in. You should consider that all the files named normally in a directory might be temporary solely because of the directory they are in. I have also adopted a convention of appending a number or a date to a file if I want to keep a few versions around (some_utility.3 or some_utility.pl.3 or some_utility.2010-12-31). You may also want to consider using a checksum to determine if there are any truly duplicate files regardless of the name.

As for identifying the truly temporary files - all 3 of your examples are exactly 10 characters long. I am not sure if that is a coincidence but it should be efficient to write a more robust noise detector if it is only applied to files that are 10 characters long that do not contain a period.

Cheers - L~R

Replies are listed 'Best First'.
Re^2: Finding Temporary Files
by eff_i_g (Curate) on Jan 14, 2011 at 22:53 UTC
    L~R,

    Thanks for your input. I've attached the latest and greatest which includes:

    1. Logging and reporting
    2. Share selection with capacities
    3. Updated RE's
    4. A union of Solaris' and WordNet's dictionaries
    5. A list of suspects to exclude from the dictionary
    6. A find that includes directories

    Some aspects are customized to our environment (shares, RE's, and a few tidbits), but overall I'm pleased with what I have so far and I think it's easy enough to expand. It takes under an hour to scour ~1TB of shares and comes back with ~2,500 offenders totalling 25G. I just finished updating the script so I need to review the code for bugs and tweaks, but I've included it below nonetheless.

    For now I've forgone checksums for similarly named files because this should not be an issue for us. Also, the 10 character lengths were a coincidence—I'm looking for variable lengths.

    For me a run looks like this:
    The report like this:
    And, finally, the code: