Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

oldfiles

by neilwatson (Priest)
on Nov 14, 2002 at 19:18 UTC ( [id://212968]=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts
Author/Contact Info Neil Watson
watson-wilson.ca
Description: Recover your disk space!

oldfiles searches a directory and sub directories for files of a certain age and size. A report is emailed to the owners of those files asking them to remove or archive them.

This is a first draft. Everyone is encouraged to make comments and suggestions.

Thank you.

#!/usr/bin/perl

use strict;
use warnings;
use File::Find;
use Time::Local;
use Getopt::Std;
use Mail::Sender;

my $sender = new Mail::Sender;

#Neil H Watson on Tue Mar 26 13:34:29 EST 2002
#usage oldfiles -d <directory> -a <days old> <option>
#option -s <size in megabytes> reports only file larger than

my %opt= ( #set default options
    d => ".",
    a => 0, 
    s => 0, 
);
getopts("d:a:s:", \%opt); #: means string to follow

if ($opt{a} == 0){
    print <<"*END*";
usage oldfiles2 -d <directory> -a <days old> <option>
option -s <size in megabytes> reports only file larger than

Report of matching files will be emailed to the owners
*END*
}

my ($uid, %files, $size, @pw, %unames, $str);
my $host = `/bin/hostname`;
my $domain = "mydomain.com";
my $support = "support\@mydomain.com";
my $from = "support\@mydomain.com";

#get user ids and names
open PW, "/etc/passwd";
while (<PW>){ 
    @pw = split ":", $_;
    #unames key=id value is username
    $unames{$pw[2]} = $pw[0] || "nwatson";
    #print "id = $pw[2], name  = $pw[0]\n";
}
close PW;

find(\&wanted, $opt{d});

foreach $uid (keys %files) {

    #construct email
    $sender->Open({
        smtp => 'mail', 
        to => "$unames{$uid}\@$domain",
        from => "$from",
        subject => "Disk cleanup needed on $host",
        headers => "Errors-To: postmaster\@$domain"});
    die "Error: $Mail::Sender::Error\n" unless ref $sender;

    $sender->Body;
    $sender->SendLineEnc(<<"*END*");
The following files are listed as belonging to you and have not been a
+ccessed in some time.  If they are not needed please delete them.  If
+ you require long term, static storage of these files please email $s
+upport.  Thank you.

Files:
*END*
    $str = sprintf "%11s%11s\tFile Name","Size(MB)","Age(Days)";
    $sender->SendLineEnc("$str");
     
    #cycle through file list of user and create
    #line in email body
    foreach my $file (@{$files{$uid}}) {

        $str = sprintf "%11s%11s\t$file->{name}",$file->{size},$file->
+{age};
        $sender->SendLineEnc("$str");
    }
    $sender->Close;
}

sub wanted{ #checks for old files
    if (-r && -f){
        #get size of file (MB)
        $size=int((lstat($_))[7]/1000000);

        # check the age of file    
        if ($opt{a} < -A && $size > $opt{s}){
            $uid = (lstat($_))[4];

            #use a hash of an array of hashes :)
            #hash of uid, that points to an array
            #whose elements hold the hashes for
            #name, size, age

            push @{$files{$uid}}, {
                name => $File::Find::name,
                size => $size,
                age  => int(-A) };
        }
    }
}
Replies are listed 'Best First'.
Re: oldfiles
by graff (Chancellor) on Nov 15, 2002 at 04:10 UTC
    This is a great idea for an app. Thanks! Just one quibble / suggestion:

    The GNU "find" command should typically be available on most systems where this script would be useful (there's a windows port of "find" and now there's macos-X with a unix command-line interface available). When the file space to be searched is really big with lots of files and directories, the "find" utility will be at least 5 times faster than the File::Find module, and some users might really appreciate some means of using this when they happen to know it's available to them.

      While my observations have been far from scientific it seems to me that File::Find is faster than GNU find. It's much, much faster than the find that comes with Solaris 7.

      Neil Watson
      watson-wilson.ca

        A while back, on another SoPW thread, I made up a simple benchmark -- I'd be interested to find out if you get very different results on a comparable directory tree (if you feel so inclined...) Thanks.
Re: oldfiles
by zentara (Archbishop) on Nov 15, 2002 at 14:17 UTC
    Hi, just a minor thing. I downloaded your code and did a 
    dos2unix conversion on it, and it gave an error near "have not"
    on line 65.  It seems that you need a line feed after
    $sender->SendLineEnc(<<"*END*");
    It wouldn't be so bad, but the line is way off the screen and
    can be confusing to find.
    
Re: oldfiles
by AltBlue (Chaplain) on Nov 15, 2002 at 16:17 UTC
    - first note on your code is that you blatantly waste CPU cycles with double-quotes interpolation. heh, this is habit seen to many people that came to Perl from other programming language like C/C++ where a string should always be limited by double quotes, but many other languages (and Perl is one of them) offer extended functionality through these semantic sugarcubes :)
    e.g.:
    "." -> '.'
    "*END*" -> '*END*'
    "support\@mydomain.com" -> 'support@mydomain.com'
    ... etc ...

    - check Perl FAQ for a perlish way of getting your hostname
    - i'd rather consider using getpwuid and a caching system instead of pushing all users information into memory
    - run this script on a cron basis scanning user writeable filesystems and you'll soon get headaches (considering of course that users are always b.a.d.) - they could create a heck lotta dumb files to make your script trash your system (heh, my first thought was to tell you to switch to a more efficient storage method for the %files data structure)
    - presuming a MB equals '1000000' is wrong. indeed, HDD manufacturers inherited this darn habit, but this is not true on the filesystem thay lies onto that storage space ;-)
    - you waste again a lots of CPU cycles performing a lot of useless stats: file used in your wanted routine as $_ has already been stat-ed by the File::Find module, so now you can rely on the information in _;
    - as a principle: restrict the context of each variable as much as possible to avoid problems ;-] (i'm talking about most of the variables on this line: my ($uid, %files, $size, @pw, %unames, $str);)
     
    Here is a trimmed down code that implements that cache I told you above, eliminates all useless stats and switches %files to hash of arrays ;-]
    use Data::Dumper; use Cache::MemoryCache (); my $cache = new Cache::MemoryCache; find({wanted => \&ab_wanted, no_chdir => 1}, $opt{d}); print Dumper \%files; sub ab_wanted { if (-r _ && -f _){ my $size = int ((-s _) / 2**20); if ($size > $opt{s} && $opt{a} < -A _){ my $uid = (lstat(_))[4]; my $name = $cache->get($uid); unless (defined $name) { $name = getpwuid($uid); $cache->set($uid, $name); } push @{$files{$name}}, [ $_, $size, int -A _ ]; } } }
    --
    AltBlue.
      I find parts of this hard to follow. What is the purpose of Cache::MemoryCache? What is being cached and how is that better than letting the OS handle RAM?

      How does the information get stored using a hash of arrays?

      What does the no_chdir => 1 mean?

      Neil Watson
      watson-wilson.ca

        What does the no_chdir => 1 mean?
        from File::Find manual:
        "no_chdir"
                  Does not "chdir()" to each directory as it recurses. The wanted()
                  function will need to be aware of this, of course. In this case, $_
                  will be the same as $File::Find::name.
        
        In other words, it just means that your program will skip the chdir part which is useless for your program (so you gain again some CPU cycles) and, as a bonus, $File::Find::name equals $_, so you'll be able to use:
        push @{$files{$name}}, [ $_, $size, int -A _ ];
        instead
        push @{$files{$name}}, [ $File::Find::name, $size, int -A _ ];

        What is the purpose of Cache::MemoryCache?
        heh, i told you in my previous post to consider using getpwuid and a caching system instead of pushing all users information into memory. Cache::MemoryCache is just a wellknown module that could be useful for such things. OFC, it your specific case, this could be overworking as your program is too simple and maybe a simple hash cache would be enough. Anyway, I was trying to prepare you for more complicate stuff ;-)
        What is being cached? The correspondance between UserIDs and UserNames.
        Let's see an example using a simple hash-cache.
        my %cache; find({wanted => \&ab_wanted, no_chdir => 1}, $opt{d}); use Data::Dumper; print Dumper \%files, \%cache; sub ab_wanted { if (-r _ && -f _){ $size = int ((-s _) / 2**20); if ($size > $opt{s} && $opt{a} < -A _){ $uid = (lstat(_))[4]; my $name = $cache{$uid}; unless (defined $name) { $name = getpwuid($uid); $cache{$uid} = $name; } push @{$files{$name}}, [ $_, $size, int -A _ ]; } } }

        --
        AltBlue.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://212968]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-03-28 11:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found