comment on

Monks,

I'm working on Unix and when our data shares start filling up (80%-ish) I run a program that creates a report of suspect "temporary" files—files matching the pattern /\Acore\z|copy|te?mp|bak|\b(?:old|test)|([a-z_])\1\1/i or files older than 2 years—and e-mail it to the department to review.

One area that this script misses entirely are temporary files that were modified within 2 years and have a true temporary name (like yacb4yGI6p, YQGsCV6Rbx, and SL8qEfnFDQ).

My approach to finding these (as seen in the testing script below) is to:

Create a regex of letter trigrams
Hash words with a length >= 4
Look for all files that:
- Have a lowercase letter (that's not in the extension)
- Have an uppercase letter (that's not in the extension)
- Have a digit
- Only contain letters and digits and may or may not have an extension
- Do not contain a word from the dictionary with a length >= 4
- Do not match the list of trigrams

When I run this on ~1TB of data the results look decent. Some valid files show up, but the bulk are temporaries.

Does anyone have suggestions for improving this process or a new approach to offer? I'm certainly no linguist.

Also, keep in mind:

This process is only generating a report for humans to review—not taking action and whacking files. I believe that would be impossible due to the idiosyncracies of English, the sloppiness of some typists, and the occasional job that has legit, yet awkward naming schemes or ID's.
Files named in other languages are very rare for us.

Many thanks.

Code:

use File::Find::Rule;
use List::MoreUtils qw(all);
use Number::Format qw(format_number);
use Regexp::Assemble;

my %ngram;
my %dict;
my $total;

### Which dictionary? 'Tis set up for testing at home and work.
my $uname = `uname -a`;
my $dict = $uname =~ /debian/i ? '/usr/share/dict/american-english' :
        $uname =~ /SunOS/i ? '/usr/share/lib/dict/words' :
        undef ;

### Gather ngrams.
open my $DICT, '<', $dict or die $!;
while (<$DICT>) {
    chomp;
    ### Only allow words that begin with a lowercase letter,
    ### contain only letters (no hyphens, quotes, etc.),
    ### and have 3 or more letters.
    next unless m/\A[a-z][A-Za-z]+\z/ && length >= 3;
    print "$_\n";
    ### Gather letter trios (ngrams, or, more specifically, trigrams).
    my $str = $_;
    my @ngrams = map {
        substr($str, $_, 3);
    } 0 .. (length $_) - 3;
    ### Tally.
    ++$ngram{$_} for @ngrams;
    ++$total;
    ### Only add 4+ lengths to the dictionary--many temps were matchin
+g lengths of 3.
    ++$dict{$_} if length >= 4;
}
print "\n";
print 'Total words: ', format_number($total), "\n";

### Show the results sorted by occurrence and remove those less than 1
+%.
print "All:\n";
for my $ngram (sort {$ngram{$b} <=> $ngram{$a}} keys %ngram) {
    my $percentage = format_number(($ngram{$ngram} / $total) * 100, 1,
+ 1);
    printf "%3s: %4s (%4s%%)\n", $ngram, format_number($ngram{$ngram})
+, $percentage;
    delete $ngram{$ngram} if $percentage < 1;
}
print "\n";

print "Keepers:\n";
for my $ngram (sort {$ngram{$b} <=> $ngram{$a}} keys %ngram) {
    my $percentage = format_number(($ngram{$ngram} / $total) * 100, 1,
+ 1);
    printf "%3s: %4s (%4s%%)\n", $ngram, format_number($ngram{$ngram})
+, $percentage;
}
print "\n";

### Build an RE based on the ngrams.
my $ra = Regexp::Assemble->new;
$ra->add($_) for keys %ngram;
print $ra->re, "\n";

### Files must match these to be considered temporary.
my @REs = (
    ### Lower/upper case letters not in the extension.
    qr/\A[^.]+[a-z]/,
    qr/\A[^.]+[A-Z]/,
    ### Digit.
    qr/\d/,
    ### Name only contains upper/lower case letters or digits; ext. op
+tional.
    qr/\A[a-zA-Z\d]+(?:\.[a-zA-Z]{1,4})?\z/,
);

File::Find::Rule->file
    ->exec(
        sub {
            my $file = $_;
            ### Test for REs, words, then ngrams.
            return unless all { $file =~ $_ } @REs;
            for ($file =~ /([A-Za-z][a-z]+|[A-Z]+)/g) {
                if (exists $dict{lc $_}) {
                    print "\tSkipping '$file' due to presence of '$_'\
+n";
                    return;
                }
            }
            return if lc $file =~ $ra->re;
            print "$file\n";
        }
    )
    ->in(qw(/data /tmp));
[download]

In reply to Finding Temporary Files by eff_i_g

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.