comment on

Thanks vr ! It has really helped me a lot!

… I've re-written your code completely ...

OK, Good!

1) I rewrote my FS_sweep based on your proposal

my $d =  Win32::LongPath-> new;
…
$d-> readdirL;
[download]

Doing stress tests using C:, I got a number of problems. The script sometimes works. But often it stucks, loops and was difficult to kill. Had to use the ctrl-alt-del/activity handler to stop it. I think that the memory is overwrite by $d-> readdirL;. The largest directory I have found returned by $d-> readdirL; has 38252 entries. MS File Explorer says: 38250 objekt

2) I rewrote my FS_sweep using `while ( my $name = $dir->readdirL() )`

Below follows a script which can be used to test this approach.

This works much better. But there are still problems. The script is sometimes stuck (no cpu time is used) or looping (cpu time is used, but nothing is happening). When logging the found file pathes to a file, the frequency of the problem seem to be lower. There is probably some type of timing problem in readdirL. By accessing C:/Users the problem is rather frequent. Sweeping less complicated file structure as C: seem to be OK!

File path like <C:/Users/bo/Application Data/À> are sometimes returned by readdirL()!?

Here are some results

dir: C:   #dirs: 79923 #files: 305816 #nodes: 385739  1/s: 5749
dir: D:   #dirs: 7776 #files: 115255 #nodes: 123031  1/s: 14499
dir: Q:   #dirs: 907 #files: 16095 #nodes: 17002  1/s: 12374
dir: C:   #dirs: 67183 #depth: 13 #files: 259099 #nodes: 326282  1/s: 
+5558 (skipping 'C:/Users')
[download]

The in C:/Windows found number of files are 193 less than shown by the MS File Explorer and for directories 36 less.

In the documentation of Win32::LongPath there is in one of the examples

# recurse if dir
  if (($file ne '.') && (($stat->{attribs}
    & (FILE_ATTRIBUTE_DIRECTORY | FILE_ATTRIBUTE_REPARSE_POINT))
    == FILE_ATTRIBUTE_DIRECTORY)) {
    search_tree ($name);
    next;
  }
[download]

What does $stat->{mode} & S_IFDIR correspond to?

Here is my test script:

use strict;
use warnings;
use 5.010;

use Path::Tiny qw( path );
use Data::Dump qw(dump dd ddx);

use Win32::LongPath;
use File::stat;
use Fcntl ':mode';
use Benchmark qw(:all);
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";

my @dir_skip = (
    '$RECYCLE.BIN', 'System Volume Information', 'Config.Msi', 'C:/Use
+rs'
    #, 'C:/AMD', 'C:/hp'
);
my $dir_skip        = join '|', map { quotemeta } @dir_skip;
my $dir_skip_regexp = qr {$dir_skip$};

sub do_dir {
    my $dir_path = shift;
    my $sub_ref  = shift;    # callback

    my $dir = Win32::LongPath->new;
    unless ( $dir->opendirL($dir_path) ) {
        warn "!! unable to open $dir_path ($^E)";
        return;
    }
    my @dir_name;

    while ( my $name = $dir->readdirL() ) {

        if ( $name =~ m{ ^[.]{1,2}$ }x ) { next; }
        my $path = "$dir_path/$name";

        my $stat = lstatL($path);

        if ( !defined $stat ) {
            next if $^E =~ /Åtkomst nekad/;
            warn "!! SKIP $^E <$path>";
            next;
        }
        if ( $stat->{mode} & S_IFREG ) {    # normal file
            $sub_ref->( $path, $stat );     # call callback
        }
        elsif ( $stat->{mode} & S_IFDIR ) {    # dir
            push @dir_name, $name;
        }
        else {
            warn "!! ? $name";
        }
    }
    return \@dir_name;
}

{
    my @to_do;
    my $max_depth;

    sub td_to_txt { dump @to_do; }

    sub td_clear { @to_do = (); $max_depth = 0; }

    sub td_down {
        push @to_do, [];
        my $depth = @to_do;
        $max_depth = $depth if $depth > $max_depth;
    }

    sub max_depth { return $max_depth }

    sub td_add {
        my $name = shift;
        @to_do = [] unless (@to_do);
        push @{ $to_do[-1] }, $name;
    }

    sub td_add_aref {
        my $dir_aref = shift;
        push @{ $to_do[-1] }, @$dir_aref;
    }

    sub td_path_next {
        return join '/', map { $_->[0] } @to_do;
    }

    sub td_remove_dir {
        my $aref    = $to_do[-1];
        my $removed = shift @$aref;    # remove dir
        return if @$aref;              # more dirs

        while ( $aref = $to_do[-1] ) {
            if ( !@$aref ) {
                $removed = pop @to_do;    # remove level
                next;
            }
            $removed = shift @{ $to_do[-1] };    # remove dir
            return if @$aref;                    # more dirs
        }
    }
}

sub FS_sweep {
    my $dir_path = shift;
    my $sub_ref  = shift;
    td_clear;
    td_down;
    td_add($dir_path);
    my $dir_cnt = 0;
    my $t0      = Benchmark->new;

    while ( my $dir_path = td_path_next ) {
        if ( $dir_path =~ m{$dir_skip_regexp} ) {
            warn "SKIPING DIR $dir_path";
            td_remove_dir;
            next;
        }

        $dir_cnt++;
        my $dir_name_aref = do_dir( $dir_path, $sub_ref );

        my $sub_dir_nof = @$dir_name_aref;
        if ( $sub_dir_nof > 1000 ) {
            warn "!! MANY SUBDIR  $sub_dir_nof in $dir_path";
        }

        if (@$dir_name_aref) {    # subdir
            td_down;
            td_add_aref($dir_name_aref);
        }
        else {
            warn '!! ! defined $dir_name_aref' if !defined $dir_name_a
+ref;
            td_remove_dir;
        }
    }
    my $td = timediff( Benchmark->new, $t0 );
    return $dir_path, $dir_cnt, max_depth, $td;
}

my @output;
my $file_cnt = 0;

sub file_log {
    my $file_path = shift;
    $file_cnt++;
    warn "!#      $file_cnt $file_path\n" if not $file_cnt % 10000;
}

sub summary {
    my $dir_path        = shift;
    my $dir_cnt         = shift;
    my $max_depth       = shift;
    my $td              = shift;
    my $node_cnt        = $dir_cnt + $file_cnt;
    my $node_per_second = $td->cpu_p > 0 ? $node_cnt / $td->cpu_p : -1
+;
    my $txt =
      sprintf
"\n\n!! FS_sweep summary dir: %s\n   #dirs: %d #depth: %d #files: %d #
+nodes: %d  1/s: %d\n",
      $dir_path, $dir_cnt, $max_depth, $file_cnt, $node_cnt, $node_per
+_second;
    $file_cnt = 0;
    return $txt;
}

my $ls_log = 1;    # activate listing of files in 'ls_log.txt'
my $log_fh;
$log_fh = path('ls_log.txt')->openw_utf8 if $ls_log;

sub FS_file_big {
    my $file_path     = shift;
    my $stat_hash_ref = shift;
    file_log($file_path);
    say {$log_fh} $file_path if $ls_log;
    my $size = $stat_hash_ref->{size};
    push @output, "BIG $file_path size: $size\n" if $size > 100000000;
}

sub output {
    if (@output) {
        say "Output:";
        say map { "$_\n" }
          grep { defined } @output[ 0 .. 100, 1000 .. 1010, 2000 .. 20
+10 ];
        say "END Output\n";
        @output = ();
    }
    STDOUT->flush;
}

say summary( FS_sweep( 'C:/Windows', \&FS_file_big ) );
output;

say summary( FS_sweep( 'C:', \&FS_file_big ) );
output;

foreach my $dev (qw{ }) { # add C D ...
    warn "!! START $dev: =======================================\n";
    say summary( FS_sweep( "$dev:", \&FS_file_big ) );
    output;
}
[download]

In reply to Re^2: Function to sweep a file tree by bojinlund
in thread Function to sweep a file tree by bojinlund

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


No such thing as a small change
	PerlMonks

comment on

1) I rewrote my FS_sweep based on your proposal

2) I rewrote my FS_sweep using while ( my $name = $dir->readdirL() )

2) I rewrote my FS_sweep using `while ( my $name = $dir->readdirL() )`