Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Finding Temporary Files

by eff_i_g (Curate)
on Jan 07, 2011 at 21:13 UTC ( [id://881164]=perlquestion: print w/replies, xml ) Need Help??

eff_i_g has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I'm working on Unix and when our data shares start filling up (80%-ish) I run a program that creates a report of suspect "temporary" files—files matching the pattern /\Acore\z|copy|te?mp|bak|\b(?:old|test)|([a-z_])\1\1/i or files older than 2 years—and e-mail it to the department to review.

One area that this script misses entirely are temporary files that were modified within 2 years and have a true temporary name (like yacb4yGI6p, YQGsCV6Rbx, and SL8qEfnFDQ).

My approach to finding these (as seen in the testing script below) is to:

  1. Create a regex of letter trigrams
  2. Hash words with a length >= 4
  3. Look for all files that:
    • Have a lowercase letter (that's not in the extension)
    • Have an uppercase letter (that's not in the extension)
    • Have a digit
    • Only contain letters and digits and may or may not have an extension
    • Do not contain a word from the dictionary with a length >= 4
    • Do not match the list of trigrams

When I run this on ~1TB of data the results look decent. Some valid files show up, but the bulk are temporaries.

Does anyone have suggestions for improving this process or a new approach to offer? I'm certainly no linguist.

Also, keep in mind:

  1. This process is only generating a report for humans to review—not taking action and whacking files. I believe that would be impossible due to the idiosyncracies of English, the sloppiness of some typists, and the occasional job that has legit, yet awkward naming schemes or ID's.
  2. Files named in other languages are very rare for us.

Many thanks.

Code:
use File::Find::Rule; use List::MoreUtils qw(all); use Number::Format qw(format_number); use Regexp::Assemble; my %ngram; my %dict; my $total; ### Which dictionary? 'Tis set up for testing at home and work. my $uname = `uname -a`; my $dict = $uname =~ /debian/i ? '/usr/share/dict/american-english' : $uname =~ /SunOS/i ? '/usr/share/lib/dict/words' : undef ; ### Gather ngrams. open my $DICT, '<', $dict or die $!; while (<$DICT>) { chomp; ### Only allow words that begin with a lowercase letter, ### contain only letters (no hyphens, quotes, etc.), ### and have 3 or more letters. next unless m/\A[a-z][A-Za-z]+\z/ && length >= 3; print "$_\n"; ### Gather letter trios (ngrams, or, more specifically, trigrams). my $str = $_; my @ngrams = map { substr($str, $_, 3); } 0 .. (length $_) - 3; ### Tally. ++$ngram{$_} for @ngrams; ++$total; ### Only add 4+ lengths to the dictionary--many temps were matchin +g lengths of 3. ++$dict{$_} if length >= 4; } print "\n"; print 'Total words: ', format_number($total), "\n"; ### Show the results sorted by occurrence and remove those less than 1 +%. print "All:\n"; for my $ngram (sort {$ngram{$b} <=> $ngram{$a}} keys %ngram) { my $percentage = format_number(($ngram{$ngram} / $total) * 100, 1, + 1); printf "%3s: %4s (%4s%%)\n", $ngram, format_number($ngram{$ngram}) +, $percentage; delete $ngram{$ngram} if $percentage < 1; } print "\n"; print "Keepers:\n"; for my $ngram (sort {$ngram{$b} <=> $ngram{$a}} keys %ngram) { my $percentage = format_number(($ngram{$ngram} / $total) * 100, 1, + 1); printf "%3s: %4s (%4s%%)\n", $ngram, format_number($ngram{$ngram}) +, $percentage; } print "\n"; ### Build an RE based on the ngrams. my $ra = Regexp::Assemble->new; $ra->add($_) for keys %ngram; print $ra->re, "\n"; ### Files must match these to be considered temporary. my @REs = ( ### Lower/upper case letters not in the extension. qr/\A[^.]+[a-z]/, qr/\A[^.]+[A-Z]/, ### Digit. qr/\d/, ### Name only contains upper/lower case letters or digits; ext. op +tional. qr/\A[a-zA-Z\d]+(?:\.[a-zA-Z]{1,4})?\z/, ); File::Find::Rule->file ->exec( sub { my $file = $_; ### Test for REs, words, then ngrams. return unless all { $file =~ $_ } @REs; for ($file =~ /([A-Za-z][a-z]+|[A-Z]+)/g) { if (exists $dict{lc $_}) { print "\tSkipping '$file' due to presence of '$_'\ +n"; return; } } return if lc $file =~ $ra->re; print "$file\n"; } ) ->in(qw(/data /tmp));

Replies are listed 'Best First'.
Re: Finding Temporary Files
by graff (Chancellor) on Jan 08, 2011 at 03:00 UTC
    Just a general comment: Since you're trying to create a list that maximizes the ease and efficiency of manual review, it would make more sense to do a suitable rank-sorting of the list, rather than categorization -- e.g. files most likely to be temporary (with file names that are not generated by humans) should dominate the top of the list. Ngram statistics would be a natural basis for ranking file names according to the likelihood that they are temp files.

    To build a suitable "background" ngram model, it might be good to supplement (or replace) your dictionary with a "corpus" of non-temp-file names. For example, if you take all the file names that include punctuation (e.g. [-_+=. :]), split on punctuation, and count trigrams within chunks of 3 or more alphanumerics, you should have a more "realistic" set of probabilities for trigrams that make up non-temp file names.

    Then it's just a matter of assigning a score to each file name in a given list (update: i.e. of file names that have no punctuation), such that names using a lot of improbable trigrams score very low, and those comprising mostly plausible (likely, frequent) trigrams score very high. Sort the list by score (lowest first), and files that come out on top are most likely to be the easiest for human judges to dismiss as obvious temp files.

    And then it's just a matter of the judges deciding how far down the list they need to go in order to "finish" (because they've already found enough temp files to free up adequate space, or because they reach a point where there are too few temp files left to bother with).

    Of course, I'd be tempted to include file size in the sorting somehow -- deleting bigger temp files first would be a big help. But I don't know how well that would apply to your case.

Re: Finding Temporary Files
by flexvault (Monsignor) on Jan 08, 2011 at 16:47 UTC

    While not a direct reply to your code, I have to comment on the process you're using.

    I'm working on Unix ... I run a program that creates a report of suspect "temporary" files ... older than 2 years
    —and e-mail it to the department to review

    First, you are using the right tool - perl.

    But, Unix/Linux is so much more stable than in the early years . . . and 80% can go to 100% in seconds.

    My first use of perl was to purge temporary 'ftp' files that were left around in odd places. Waiting 2 years seems way too long.

    I just took a look at a Unix email server that gets over 12 incoming emails per second (mostly spam), and our temporary filesystems never had more than 1000 temporary files at any one time. Looking at a Linux web server, it had a lot more temporary files, but expanded and contracted on activity.

    ( Note: This isn't to say that if we had hardware/software problems, things wouldn't change quickly! )

    But the norm is purge temporary files immediately and then each morning at 03:44 every server runs a clean-up perl cron job to delete any temp files that were not deleted by the production software. All production software is recycled. This process usually takes less than 15 seconds.

    The point I'm trying to make, is that you may be doing yourself and your organization a favor to identify code that is not cleaning up after itself. Your time is valuable to your organization, and I doubt that "department review" will do a better job than you in identifying true temporary files.

    On the other hand, if you inherited this process, and management requires it to be done this way, then just . . .

    Good Luck

    "Well done is better than well said." - Benjamin Franklin

      flexvault,

      Understood and agreed.

      There are clean-up routines that are already in place and this is an addendum. This server is strictly internal so we do not encounter issues related to hosting, e-mail, spam, FTP, etc. However, we do run a plethora of programs and special applications. Unfortunately, in busy times dictated by "just get it done", not every process is written, ran, or tested to par.

      Most of the clean up is handled by other routines so if the box does reach the eightieth percentile it's at the point where humans must intervene because (1) there's old junk hanging around that cannot be dumped automatically (or isn't being found by the cleaner-uppers), or (2) our work has actually ballooned to the point where we truly need more space. Ergo, this program is aimed at aiding in that decision: remove, archive, and/or expand?

Re: Finding Temporary Files
by Limbic~Region (Chancellor) on Jan 14, 2011 at 19:05 UTC
    eff_i_g,
    Here are some things you might want to consider. First, when I am creating a temporary file it is almost always called foo (foo.pl, foo.csv, etc). You might want to include things like foo/bar/blah/asdf to your list of candidates. Also, I often create a directory called backup or archive where I still files in. You should consider that all the files named normally in a directory might be temporary solely because of the directory they are in. I have also adopted a convention of appending a number or a date to a file if I want to keep a few versions around (some_utility.3 or some_utility.pl.3 or some_utility.2010-12-31). You may also want to consider using a checksum to determine if there are any truly duplicate files regardless of the name.

    As for identifying the truly temporary files - all 3 of your examples are exactly 10 characters long. I am not sure if that is a coincidence but it should be efficient to write a more robust noise detector if it is only applied to files that are 10 characters long that do not contain a period.

    Cheers - L~R

      L~R,

      Thanks for your input. I've attached the latest and greatest which includes:

      1. Logging and reporting
      2. Share selection with capacities
      3. Updated RE's
      4. A union of Solaris' and WordNet's dictionaries
      5. A list of suspects to exclude from the dictionary
      6. A find that includes directories

      Some aspects are customized to our environment (shares, RE's, and a few tidbits), but overall I'm pleased with what I have so far and I think it's easy enough to expand. It takes under an hour to scour ~1TB of shares and comes back with ~2,500 offenders totalling 25G. I just finished updating the script so I need to review the code for bugs and tweaks, but I've included it below nonetheless.

      For now I've forgone checksums for similarly named files because this should not be an issue for us. Also, the 10 character lengths were a coincidence—I'm looking for variable lengths.

      For me a run looks like this:
      The report like this:
      And, finally, the code:

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://881164]
Approved by Argel
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2024-04-16 10:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found