Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

faster filesystem stats

by Anonymous Monk
on Jul 12, 2001 at 18:28 UTC ( [id://96044]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi.. I'm trying to get file stats off various networked data volumes, and I've been using opendir and readdir... however, it takes ages for anything not on the current segment. Someone has suggested using File::Find instead, but the (to my eyes) meagre documentation doesn't give me ay clues... could someone point me in the right place?... I've included the main snippets of rough code so you can see what I'm trying to do... (P.S. this is for NT/Netware systems, so cdate actually gives me creation date) many thanks Phil DG
Main code here . . . pathstat($pathname, $file_count, $dir_count, $total_size, $aged_fil +e_count, $aged_total_size); print "\n\nFile Statistics for $pathname\n". "\ntotal file count is $file_count\n". "total dir count is $dir_count\n". "total size is $total_size\n\n". "number of files between $lowrange and $highrange days old is + $aged_file_count\n". "total size of aged files is $aged_total_size\n"; . . . sub pathstat { my($arg_pathname) = $_[0]; my($dir_entry); my($dir_handle) = "BIN" . $_[2]; my($file_age); my($file_size); opendir($dir_handle, $arg_pathname) or die "Can't open $arg_pathnam +e: $!"; while (defined($dir_entry = readdir $dir_handle)) { if (-d $arg_pathname . "\\" . $dir_entry) { if ($dir_entry ne "." && $dir_entry ne "..") { ++$_[2]; pathstat($arg_pathname . "\\" . $dir_entry, $_[1], $_[2], +$_[3], $_[4], $_[5]); } } else { ++$_[1]; $file_size = (-s $arg_pathname . "\\" . $dir_entry); $_[3] += $file_size; $file_age = (-C $arg_pathname . "\\" . $dir_entry); if ($file_age >= $lowrange && $file_age <= $highrange) { ++$_[4]; $_[5] += $file_size; } } } closedir($dir_handle); }

Replies are listed 'Best First'.
Re: faster filesystem stats
by OzzyOsbourne (Chaplain) on Jul 12, 2001 at 19:07 UTC

    I do a file::find here and run a secondary script that reads the logs and stats each file for the file size here:

    use strict; my ($type, $server,$out,$in,@input,$total,$kbytes,$mbytes); my @servers=('Server'); my $dir1='//machine/share'; my @types=('swf','asf','avi','mp2','mp3','mpg','mpga','mpe','mpeg','wa +v','mov','qt','mid','midi','ra','ram','rmi','rmj','rmx','zip','exe',' +wm','wma'); foreach $type (@types){ $total=0; my $out="$dir1/sifted/$type\.txt"; my $out2="$dir1/sifted/$type-ok\.txt"; open OUT, ">$out" or die "Cannot open $out for write :$!"; foreach $server (@servers){ $in="$dir1/$server\.txt"; open IN,"$in" or next; @input=<IN>; chomp @input; foreach (@input){ if (/\.$type$/i){ $kbytes = (stat)[7]/1024; $total+=$kbytes; print OUT "$_\t$kbytes KB\n"; } } close IN; } $mbytes=$total/1024; print OUT "\n\nTotal: $mbytes MB\n"; close OUT; if ($mbytes eq 0){ rename $out, $out2; } print "Finished $type...\n"; }

    The main script takes 8 hours for for 40 servers with gigs and gigs per server. The second script takes 15 minutes to run through. Defining the types of files actually speeds up the process as the stat is the part that really slows the code down ($kbytes = (stat)[7]/1024; That's the file size portion.) You could combine these into something you can use...

    -OzzyOsbourne

(tye)Re: faster filesystem stats
by tye (Sage) on Jul 13, 2001 at 05:44 UTC

    Using File::Find will likely be faster simply because File::Find chdir()s into each directory as it recurses so that you are doing things like stat("file.txt") instead of stat("root/subdir/subsubdir/file.txt") which has to at least parse that path every time and probably traverse each of the directories mentioned each time.

    Another way to make your code faster is to use the special stat target of _ which lets you get more data about the same file without making Perl call stat over and over.

    The trick with File::Find is how to share the variables between related calls to your "wanted" subroutine while not sharing them between unrelated calls to your "wanted" subroutine.

    You could do something very similar to what you have above with:

    find( sub { filestat( "ignored", $file_count, $dir_count, $total_size, $aged_file_count, $aged_total_size ); }, $pathname );
    and then rip out most of your "pathstat" and rename it "filestat":
    sub filestat { if (-d $_) { if ($_ ne "." && $_ ne "..") { ++$_[2]; } } else { ++$_[1]; $_[3] += -s _; my $file_age= (-C _); if ($file_age >= $lowrange && $file_age <= $highrange) { ++$_[4]; $_[5] += -s _; } } }
    but it is possible to clean that up much more.

    If going for maximal speed, I'd probably make that code a bit easier to read and maintain by using symbolic constants instead of literal 1 through 5:

    sub iFileCount() { 0; } sub iDirCount() { 1; } sub iTotalSize() { 2; } sub iAgedFileCount() { 3; } sub iAgedTotalSize() { 4; } find( sub { filestat( $file_count, $dir_count, $total_size, $aged_file_count, $aged_total_size ); }, $pathname ); sub filestat { my($file_age); my($file_size); if (-d $_) { if ($_ ne "." && $_ ne "..") { ++$_[iDirCount]; } } else { ++$_[iFileCount]; $file_size = (-s _); $_[iTotalSize] += $file_size; $file_age = (-C _); if ($file_age >= $lowrange && $file_age <= $highrange) { ++$_[iAgedFileCount]; $_[iAgedTotalSize] += $file_size; } } }

    You could also consider using File::Recurse which has some niceties over File::Find [ but maybe isn't being maintained anymore? ): ].

    You could probably make your own code faster even than File::Find code by reworking it to use chdir (and the "-x _" trick) since File::Find will often have to stat a file but your "wanted" routine can't tell when File::Find has already stated it so you have to stat each file and you end up with you and File::Find both stating the files much of the time.

            - tye (but my friends call me "Tye")

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://96044]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2024-03-28 17:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found