http://qs321.pair.com?node_id=320779

ibanix has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I've got a Windows2K file server with over 2 million pdf files. We want to delete any of these pdfs more than X days old (X seems to keep changing). So, I wrote up a script (below) to automate this; I'm using ActivePerl 5.6.

My problem is that the script seems to cause the system to eat up all it's "system cache" memory! The perl process only seems to use nominal amount of memory. I blame Windows 2000, but has anyone else seen this problem? Anything in my script that stands out as bad?

### Delete pdf files older than X days from given directory tree ### use strict; use warnings; use File::Find; my @directories; $directories[0] = $ARGV[0]; my $days_old = $ARGV[1]; finddepth(\&wanted, @directories); # File::Find coderef # Find all PDFs older than given # of days & delete sub wanted { # Turn forward slashes into backslashes to make real Win32 paths my $file = $File::Find::name; $file =~ s|/|\\|g; if ( ($file =~ m|\.pdf$|) && (int(-M $file) > $days_old) ) { print "Found: $file, deleted\n"; unlink($file) || print "Unable to delete $file!\n"; } }


Thanks in advance,
ibanix

$ echo '$0 & $0 &' > foo; chmod a+x foo; foo;

Replies are listed 'Best First'.
Re: Some File::Find woes. ($_ not $name)
by tye (Sage) on Jan 12, 2004 at 21:50 UTC

    You are doing this very inefficiently. File::Find goes to the trouble to chdir into the directory that holds the files you are looking at and then you do operations on a long path string1 which means the operating system needs to reparse the path and pull information about all of the parent directories out of (and perhaps in to) the cache.

    First, set $File::Find::dont_use_nlink= 1; (to make your script portable -- it won't be using this on Win32 since Win32 file systems just don't work that way), then use -M _ instead of your current -M $file, unlink($_) instead of unlink($file), and (much less importantly) use $_ =~ m|\.pdf$| (and you can drop the $_ =~ part if you so desire).

    Note that I said -M _ and not -M $_, the former being a bit more efficient because $File::Find::dont_use_nlink= 1; assures that File::Find has already done an lstat on $_ for you so you don't need to do it again.

                    - tye

    1 ...which might even be incorrect if $ARGV[0] didn't contain an absolute path name. I'd have to reread the File::Find docs to determine that, but what you do to fix this is the same regardless so I won't.

    (updated: trivial)

      Thank you, this solved the system cache problem. I also took the other posts to heart and may try those solutions too.

      Thanks all!

      $ echo '$0 & $0 &' > foo; chmod a+x foo; foo;
Re: Some File::Find woes.
by bluto (Curate) on Jan 12, 2004 at 23:50 UTC
    In addition to tye's suggestions, you'll may also want to make sure you are actually removing a file by checking the type (i.e. with the '-f' operator) unless you know for sure you won't have directories ending in '.pdf'.

    FWIW, I bet 2M files takes a while to search through. During this time I'm assuming the tree can change quite a bit. You may see interesting errors as File::Find tries to cope with moved/renamed subtrees/files.

    Perhaps W2K has some built in and/or scriptable method of locating files more quickly than searching the entire namespace one entry at a time?

    bluto

Re: Some File::Find woes.
by Tommy (Chaplain) on Jan 13, 2004 at 00:52 UTC

    You don't need to use File::Find for this at all. Further, you don't need to use absolute paths, and you don't need to convert "/" to "\". Windows understands "/" in file paths for a long time now.

    #!/usr/bin/perl -w # this code only partially tested my($dir) = $ARGV[0] ||''; my($maxage) = $ARGV[1] ||0; $maxage = int $maxage; die "Can't operate without a target directory spec. Op aborted.\n" unless length $dir; # dir could be named "0" die qq[No such directory "$dir"\n] unless -e $dir; die "Need max allowed age spec for files. Op aborted.\n" unless $maxage; # zero values and non-numeric values not accepted. local *PDFDIR; opendir(PDFDIR, $dir) or die $!; my(@files) = grep(/\.pdf$/, readdir(PDFDIR)); closedir(PDFDIR) or warn $!; print qq[Nothing to do. No files present in "dir".\n] and exit unless + @files; foreach (@files) { print qq[Skipping directory "$dir/$_"\n] if -d $_; unlink $dir . '/' . $_ if int(-M $dir . '/' . $_) > $maxage or die qq[Can't unlink "$dir/$_"! $!]; print qq[Deleted "$dir/$_"\n]; } print "Done.\n\n" and exit;
    --
    Tommy Butler, a.k.a. TOMMY
    
      if you plan a non-File::Find you'd better make your procedure a recursiv one.
      my($dir) = $ARGV[0] ||''; my($maxage) = $ARGV[1] ||0; $maxage = int $maxage; die "Can't operate without a target directory spec. Op aborted.\n" unless length $dir; # dir could be named "0" die qq[No such directory "$dir"\n] unless -e $dir; die "Need max allowed age spec for files. Op aborted.\n" unless $maxage; # zero values and non-numeric values not accepted. sub gothere { my ($ddir) = @_; $ddir=~ s|$|/|; local *PDFDIR; opendir(PDFDIR, $ddir) or die $!; my(@files) = map {$ddir . $_} readdir(PDFDIR); closedir(PDFDIR) or warn $!; print qq[Nothing to do. No files present in "dir".\n] and exit unl +ess @files; foreach my $f (@files) { next if $f =~ /^\.+$/; #don't want . .. gothere($f) if -d $f; if ((int(-M $f) > $maxage) && (-f _) && $f=~/\.pdf$/i) {unlink $f or die qq[Can't unlink "$f"! $!];} print qq[Deleted "$f"\n]; } } gothere($dir); print "Done.\n\n" and exit;
      warning untested.
      --
      dominix

      Adding a note, of course, that you are assuming the directories they want to search do not contain sub-directories that need searching - which, for this many files though, is very likely to be the case.

      .02

      cLive ;-)

Re: Some File::Find woes.
by paulbort (Hermit) on Jan 13, 2004 at 18:39 UTC
    Of course it's eating all of your system cache, you're trawling through TWO MILLION files. The comments suggesting reading the directories instead of the files are excellent, I only have a couple of minor tidbits to add:

    - If you have a way to hook into the process that creates the PDF, you could write to a one-table database, or even a text file, listing the file and when you want it to expire. Then the cleanup program only has to scan the list for files past their expiration date, and delete them. (You could just about do that in a batch file.)

    - If you can change the structure of where the files are stored, put them in directories named based on when the files should go away, like '2004-01-20', and delete any directories older than today.


    --
    Spring: Forces, Coiled Again!

      That's using your noggin. Great ideas.

      --
      Tommy Butler, a.k.a. TOMMY
      
Re: Some File::Find woes.
by scratch (Sexton) on Jan 13, 2004 at 18:01 UTC
    I'll suggest not doing this in Perl. I have to do a similar job periodically and I've settled on a utility named 'delen.' It's a powerful DOS delete utility. If you want to recursively delete all .pdf files over, say, 30 days, this will do it:

    delen c:\*.pdf [d-7000,-30] /s

    I'm pretty happy with it.

    Hello Cleveland!