Memory Management Problem

PrimeLord has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Memory Management Problem
by PodMaster (Abbot) on Nov 20, 2003 at 21:37 UTC

perldoc AnyDBM_File

perldoc DB_File

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: Memory Management Problem
by Roger (Parson) on Nov 21, 2003 at 00:07 UTC

#!/usr/local/bin/perl -w
use strict;
use DBI;
use DBD::Sybase;
use File::Find;

my $dbh = ....;  # connect to database
my $sth = $dbh->prepare("select filename from bench where filename=?")
+;

# look for new files
my @new_files;
find( { follow => 1, no_chdir => 1,
        wanted =>
          sub {
            if (! /\.$/) {            # ignore unwanted . or ..
              $sth->execute($_);
              my $file_exists;
              while (my @res = $sth->fetchrow_array()) { $file_exists+
++ }
              push @new_files if !$file_exits;
            }
          }
      }, '/');
$sth->finish;

# insert new files into the database
$sth = $dbh->prepare("insert into bench (filename) values (?)");
$sth->execute($_) for @new_files;
$sth->finish;

# do stuff with @new_files ....
[download]

#!/usr/local/bin/perl -w
use strict;
use File::Find;

my $bench_file = 'bench.txt';

my @new_files;
find( { follow => 1,  no_chdir => 1,
        wanted => \&callback }, '/');

sub callback {
  if (! /\.$/) {            # ignore unwanted . or ..
    if ( ! `grep '$_' $bench_file` ) {
      push @new_files, $_;  # remember this file
    }
  }
}

# append unseen filenames to bench.txt file
open BENCH, ">>bench.txt"
     or die "Can not append to bench.txt";
print BENCH "$_\n" foreach (@new_files);
close BENCH;
[download]

[reply]
[d/l]
[select]

Re: Memory Management Problem
by Zaxo (Archbishop) on Nov 20, 2003 at 21:40 UTC

How about dumping the new report to a file, then generating the comparison of the two reports line-by-line? That will remove all but the wildest memory constraints. You can rename files after you're done with all that.

After Compline,
Zaxo

[reply]

Re: Memory Management Problem
by duff (Parson) on Nov 20, 2003 at 21:51 UTC

Depending on the actual content of your report, you may want to push some of the work you're currently doing in perl out to the find command. You could use the -cnewer option to find to get a list of files that are newer than the timestamp of some other file. So, after each scan, you touch a special file (maybe that's your benchmark_file) and then use that next time to find out which are newer. man find.

Using that information (and some other standard unix utilities) you should be able to generate a file with all of the files that were there yesterday and another file with all of the files that are new today. And then just read those two files a line at a time for your report (again, depending on the exact output of your report).

Just some ideas ...

duff

[reply]

Re: Memory Management Problem
by swngnmonk (Pilgrim) on Nov 21, 2003 at 07:02 UTC

Prime,

Can you provide a little more information about the contents of the Bench file, and what print_report() is doing?

From what I gather, the bench file is simply a list of absolute file paths on the filesystem (since you're using a find call to populate %today). What exactly are you trying to track?

Another question - have you confirmed your find command on your machine? On my box (redhat 9), that call to find (assuming $search_files is a scaler for a text match of some kind) would return every file on the filesystem. Are you sure you're getting the correct results?

Now that I think about it, I've got an idea on a general approach, assuming you've got access to the standard Unix utils - use sort, uniq, and diff, and parse the output of the diff. e.g.

`cat benchmark_files|sort|uniq -c > benchmark_counted`;
`find / $search_files -print |sort | uniq -c > todays_find`;

open IN, "diff benchmark_counted todays_find|" or die "$!";

while (<IN>) {

   ## parse diff output into %yesterday and %today
   ## an exercise for the reader

}
close IN;
[download]

By using the unix tools, you've now got the same output as you had after the call to _scan_system(). Note - diff will flag identical lines with different counts (that's what the -c option to uniq does) - you'd have to account for that when parsing the diff output.

This assumes, of course, that the real memory hog is %yesterday, before a pile of keys are deleted in building %today. If I'm wrong, and at the end of processing %yesterday and %today are both too big to handle by print_report(), you may well need to look at some kind of BerkeleyDB-type solution, but realize it's going to slow things down by a lot.

I hope this helps - sort/diff/uniq can be a great way to reduce the load on perl when processing large files.

[reply]
[d/l]

Re: Re: Memory Management Problem

by thospel (Hermit) on Nov 21, 2003 at 19:51 UTC

useless use of cat award

[reply]

Re: Re: Re: Memory Management Problem

by swngnmonk (Pilgrim) on Nov 21, 2003 at 20:11 UTC

Doh. Point made. :)

let that line read:

`sort benchmark_files | uniq -c > benchmark_counted`;
[download]

And all can be right in the world.

[reply]
[d/l]

Re^4: Memory Management Problem

by thospel (Hermit) on Nov 22, 2003 at 16:21 UTC

Re: Memory Management Problem
by thospel (Hermit) on Nov 21, 2003 at 19:35 UTC

Basically it seems like your files represent sets, and order isn't relevant. Comparing two big sets is easiest if both sets are sorted since you can then simply keep an active pointer in each sorted sequence and progress them in tandem.

Remains the question of how to sort the sets. One way is to use unix sort, which normally will not load a big file completely in memory. So that idea leads to code like:

# warning: untested code

# A string that will sort beyond any returned file (they all start wit
+h /)
use constant INFINITY => chr(ord("/")+1);

open(local *YESTERDAY, "<", $yesterday_file) || 
    die "Could not open $yesterday_file: $!";
open(local *CURRENT, "find / $search_files -print | sort") ||
    die "Could not start find: $!";
open(local *TODAY, ">", $today_file) ||
    die "Could not create $today_file: $!";
my $yesterday = <YESTERDAY> || INFINITY;

local $_;
while (<CURRENT>) {
    print TODAY $_;
    while ($yesterday lt $_) {
        print "Lost file $yesterday";
        $yesterday = <YESTERDAY> || INFINITY;
    }
    # Now $yesterday ge $_
    if ($yesterday gt $_) {
        print "New file $_";
    } else {
        $yesterday = <YESTERDAY> || INFINITY;
    }
}
if ($yesterday ne INFINITY) {
    print "Lost file $yesterday";
    print "Lost file $_" while <YESTERDAY>;
}
[download]

Due to the sort it still has complexity O(n*log(n)) in the number of files. It would be nice if find had an option to walk the directories in lexical order, since then the sorting only needs to happen on the directory level, which very likely makes the logaritmic factor very low. Instead you could make perl do the find work. This causes you to miss out on many of the clever optimizations find style programs can do though, so this might not always be a gain (considering the amount of files you process it probably is though).

In perl you can do a directory walk using File::Find and you can even use find2perl to convert a find specification to equivalent perl code. But as a quick and dirty demo I'll show the code with a handrolled loop here where I list all names that aren't directories

# Again untested, so take care !
# A string that will sort beyond any returned file (they all start wit
+h /)
use constant INFINITY => chr(ord("/")+1);

my $yesterday;

sub walk_dir {
    # dir argument is assumed to already end on /
    my $dir = shift;
    opendir(local *DIR, $dir) || die "Could not opendir $dir: $!";
    for (sort readdir(DIR)) {
        next if $_ eq "." || $_ eq "..";
        my $f = "$dir$_";
        if (-d $f) {
            walk_dir("$f/");
        } else {
            $f .= "\n";
            print TODAY $f;
            while ($yesterday lt $f) {
                print "Lost file $yesterday";
                $yesterday = <YESTERDAY> || INFINITY;
            }
            # Now $yesterday ge $f
            if ($yesterday gt $f) {
                print "New file $f";
            } else {
                $yesterday = <YESTERDAY> || INFINITY;
            }
        }
    }
}

open(local *YESTERDAY, "<", $yesterday_file) ||
    die "Could not open $yesterday_file: $!";
open(local *TODAY, ">", $today_file) ||
    die "Could not create $today_file: $!";
$yesterday = <YESTERDAY> || INFINITY;

walk_dir("/");

if ($yesterday ne INFINITY) {
    print "Lost file $yesterday";
    local $_;
    print "Lost file $_" while <YESTERDAY>;
}
[download]

Update I forgot to stress that in this last solution there is no place anymore that would be expected to use a lot of memory (like e.g. a shell sort based one still would do). Real memory use will probably be only a few megabytes (I'm assuming no directory is huge).

It might in fact still be interesting to split up the task in two processes, one running a perl based find to generate the ordered list of files, and one to run the set difference, so that the diff style work can overlap in time with the directory scanning. This would allow you to do usefull work during the disk I/O wait periods.

[reply]
[d/l]
[select]


Your skill will accomplish what the force of many cannot
	PerlMonks