Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Re^2: Function to sweep a file tree

by bojinlund (Prior)
on Jun 21, 2020 at 06:21 UTC ( #11118300=note: print w/replies, xml ) Need Help??

in reply to Re: Function to sweep a file tree
in thread Function to sweep a file tree

Thanks vr ! It has really helped me a lot!

I've re-written your code completely ...

OK, Good!

1) I rewrote my FS_sweep based on your proposal

my $d = Win32::LongPath-> new; $d-> readdirL;
Doing stress tests using C:, I got a number of problems. The script sometimes works. But often it stucks, loops and was difficult to kill. Had to use the ctrl-alt-del/activity handler to stop it. I think that the memory is overwrite by  $d-> readdirL;. The largest directory I have found returned by  $d-> readdirL; has 38252 entries. MS File Explorer says: 38250 objekt

2) I rewrote my FS_sweep using while ( my $name = $dir->readdirL() )

Below follows a script which can be used to test this approach.

This works much better. But there are still problems. The script is sometimes stuck (no cpu time is used) or looping (cpu time is used, but nothing is happening). When logging the found file pathes to a file, the frequency of the problem seem to be lower. There is probably some type of timing problem in  readdirL. By accessing C:/Users the problem is rather frequent. Sweeping less complicated file structure as C: seem to be OK!

File path like <C:/Users/bo/Application Data/> are sometimes returned by readdirL()!?

Here are some results

dir: C: #dirs: 79923 #files: 305816 #nodes: 385739 1/s: 5749 dir: D: #dirs: 7776 #files: 115255 #nodes: 123031 1/s: 14499 dir: Q: #dirs: 907 #files: 16095 #nodes: 17002 1/s: 12374 dir: C: #dirs: 67183 #depth: 13 #files: 259099 #nodes: 326282 1/s: +5558 (skipping 'C:/Users')

The in C:/Windows found number of files are 193 less than shown by the MS File Explorer and for directories 36 less.

In the documentation of Win32::LongPath there is in one of the examples

# recurse if dir if (($file ne '.') && (($stat->{attribs} & (FILE_ATTRIBUTE_DIRECTORY | FILE_ATTRIBUTE_REPARSE_POINT)) == FILE_ATTRIBUTE_DIRECTORY)) { search_tree ($name); next; }
What does $stat->{mode} & S_IFDIR correspond to?

Here is my test script:

use strict; use warnings; use 5.010; use Path::Tiny qw( path ); use Data::Dump qw(dump dd ddx); use Win32::LongPath; use File::stat; use Fcntl ':mode'; use Benchmark qw(:all); binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; my @dir_skip = ( '$RECYCLE.BIN', 'System Volume Information', 'Config.Msi', 'C:/Use +rs' #, 'C:/AMD', 'C:/hp' ); my $dir_skip = join '|', map { quotemeta } @dir_skip; my $dir_skip_regexp = qr {$dir_skip$}; sub do_dir { my $dir_path = shift; my $sub_ref = shift; # callback my $dir = Win32::LongPath->new; unless ( $dir->opendirL($dir_path) ) { warn "!! unable to open $dir_path ($^E)"; return; } my @dir_name; while ( my $name = $dir->readdirL() ) { if ( $name =~ m{ ^[.]{1,2}$ }x ) { next; } my $path = "$dir_path/$name"; my $stat = lstatL($path); if ( !defined $stat ) { next if $^E =~ /tkomst nekad/; warn "!! SKIP $^E <$path>"; next; } if ( $stat->{mode} & S_IFREG ) { # normal file $sub_ref->( $path, $stat ); # call callback } elsif ( $stat->{mode} & S_IFDIR ) { # dir push @dir_name, $name; } else { warn "!! ? $name"; } } return \@dir_name; } { my @to_do; my $max_depth; sub td_to_txt { dump @to_do; } sub td_clear { @to_do = (); $max_depth = 0; } sub td_down { push @to_do, []; my $depth = @to_do; $max_depth = $depth if $depth > $max_depth; } sub max_depth { return $max_depth } sub td_add { my $name = shift; @to_do = [] unless (@to_do); push @{ $to_do[-1] }, $name; } sub td_add_aref { my $dir_aref = shift; push @{ $to_do[-1] }, @$dir_aref; } sub td_path_next { return join '/', map { $_->[0] } @to_do; } sub td_remove_dir { my $aref = $to_do[-1]; my $removed = shift @$aref; # remove dir return if @$aref; # more dirs while ( $aref = $to_do[-1] ) { if ( !@$aref ) { $removed = pop @to_do; # remove level next; } $removed = shift @{ $to_do[-1] }; # remove dir return if @$aref; # more dirs } } } sub FS_sweep { my $dir_path = shift; my $sub_ref = shift; td_clear; td_down; td_add($dir_path); my $dir_cnt = 0; my $t0 = Benchmark->new; while ( my $dir_path = td_path_next ) { if ( $dir_path =~ m{$dir_skip_regexp} ) { warn "SKIPING DIR $dir_path"; td_remove_dir; next; } $dir_cnt++; my $dir_name_aref = do_dir( $dir_path, $sub_ref ); my $sub_dir_nof = @$dir_name_aref; if ( $sub_dir_nof > 1000 ) { warn "!! MANY SUBDIR $sub_dir_nof in $dir_path"; } if (@$dir_name_aref) { # subdir td_down; td_add_aref($dir_name_aref); } else { warn '!! ! defined $dir_name_aref' if !defined $dir_name_a +ref; td_remove_dir; } } my $td = timediff( Benchmark->new, $t0 ); return $dir_path, $dir_cnt, max_depth, $td; } my @output; my $file_cnt = 0; sub file_log { my $file_path = shift; $file_cnt++; warn "!# $file_cnt $file_path\n" if not $file_cnt % 10000; } sub summary { my $dir_path = shift; my $dir_cnt = shift; my $max_depth = shift; my $td = shift; my $node_cnt = $dir_cnt + $file_cnt; my $node_per_second = $td->cpu_p > 0 ? $node_cnt / $td->cpu_p : -1 +; my $txt = sprintf "\n\n!! FS_sweep summary dir: %s\n #dirs: %d #depth: %d #files: %d # +nodes: %d 1/s: %d\n", $dir_path, $dir_cnt, $max_depth, $file_cnt, $node_cnt, $node_per +_second; $file_cnt = 0; return $txt; } my $ls_log = 1; # activate listing of files in 'ls_log.txt' my $log_fh; $log_fh = path('ls_log.txt')->openw_utf8 if $ls_log; sub FS_file_big { my $file_path = shift; my $stat_hash_ref = shift; file_log($file_path); say {$log_fh} $file_path if $ls_log; my $size = $stat_hash_ref->{size}; push @output, "BIG $file_path size: $size\n" if $size > 100000000; } sub output { if (@output) { say "Output:"; say map { "$_\n" } grep { defined } @output[ 0 .. 100, 1000 .. 1010, 2000 .. 20 +10 ]; say "END Output\n"; @output = (); } STDOUT->flush; } say summary( FS_sweep( 'C:/Windows', \&FS_file_big ) ); output; say summary( FS_sweep( 'C:', \&FS_file_big ) ); output; foreach my $dev (qw{ }) { # add C D ... warn "!! START $dev: =======================================\n"; say summary( FS_sweep( "$dev:", \&FS_file_big ) ); output; }

Replies are listed 'Best First'.
Re^3: Function to sweep a file tree
by vr (Curate) on Jun 21, 2020 at 11:42 UTC

    Hi, glad to hear your script was improved. To your questions:

    (1) Re: S_IFDIR. Easy to check: as I see there are 2 entries/directories in my C:/Users tree with FILE_ATTRIBUTE_REPARSE_POINT bit set; for both S_IFDIR is also set. So, to strictly follow Win32::LongPath documentation as to what to treat as "directory" i.e. exclude such entries, -- use fragment you quoted from documentation. Stress tests are good, but my impression was your tree to monitor should be known beforehand, whether it can or cannot contain reparse points. And, I meant the same thing (about "controlled environment"), when I said built-in readdir might be enough vs. readdirL, if a tree is known to be grown by native speakers/users of single (system) Windows code page. For versatility, sure, Win32::LongPath should be preferred.

    (2) Re: <C:/Users/bo/Application Data/> - I don't understand (was this a question/problem?)

    About freezes: that's not good. I can repeatedly sweep C:/Users and C:/Windows without issues, both with my and your scripts. At first, I suspected iterative readdirL (furthermore interspersed with lstatL calls) might be the reason, but, no, your script runs OK here. Still, to debug, maybe try to switch from iterative use to list context call. BTW, I don't observe any noticeable speed difference.

    Lastly, about different total results of (various) Perl techniques and/or what Explorer reports: if you really want to pursue to the core, there's dichotomy with extensive logging, but of course you know the method.

      About freezes:

      I have made a script with
      foreach my $done ( 1 .. 100 ) { FS_sweep( 'C:', \&FS_file_big ); ...

      Running this script several times in Emacs, using the compile command, all resulted in a freeze. The the number of calls to FS_sweep before freeze was between 0 and 91.

      Running the script in the command shell resulted in 0 to 14 succeeded calls to FS_sweep. In most cases none.

      <C:/Users/bo/Application Data/> (was this a question/problem?)

      use strict; use warnings; use 5.010; use Data::Dump qw(dump dd ddx); use Win32::LongPath; use Fcntl ':mode'; sub do_readdirL_arr { my $dir_path = shift; my $dir = Win32::LongPath->new; unless ( $dir->opendirL($dir_path) ) { warn "!! unable to open $dir_path ($^E)"; return; } my @name = $dir->readdirL(); return \@name; } sub do_readdirL_while { my $dir_path = shift; my $dir = Win32::LongPath->new; unless ( $dir->opendirL($dir_path) ) { warn "!! unable to open $dir_path ($^E)"; return; } my @name; while ( my $name = $dir->readdirL() ) { push @name, $name; } return \@name; } my @path = ( 'C:/Documents and Settings', 'C:/ProgramData/Application +Data', 'C:/ProgramData/Desktop', 'C:/ProgramData/Start Menu', 'C:/ProgramData/Templates', #'C:/Users', 'C:/Windows/appcompat/Programs', 'C:/Windows/System32/Com/dmp +', #'C:/Windows/System32/spool', 'C:/Windows/System32/Tasks', 'C:/Windows/System32/Tasks_Migrated', 'C:/Windows/SysWOW64/Com/dmp +', 'C:/Windows/SysWOW64/Tasks', 'C:/Windows/Temp', ); foreach my $path (@path) { say "$path: array: ", dump( do_readdirL_arr($path) ), ' while: ', dump do_readdirL_while($path); } __DATA__ C:/Documents and Settings: array: ["g"] while: ["~"] C:/ProgramData/Application Data: array: ["u"] while: ["{"] C:/ProgramData/Desktop: array: ["H"] while: ["|"] C:/ProgramData/Start Menu: array: ["~"] while: ["m"] C:/ProgramData/Templates: array: ["K"] while: ["e"] C:/Windows/appcompat/Programs: array: ["u"] while: ["Q"] C:/Windows/System32/Com/dmp: array: ["m"] while: ["l"] C:/Windows/System32/Tasks: array: ["\\"] while: ["\x7F"] C:/Windows/System32/Tasks_Migrated: array: ["\22"] while: ["\f"] C:/Windows/SysWOW64/Com/dmp: array: ["p"] while: ["u"] C:/Windows/SysWOW64/Tasks: array: ["["] while: ["p"] C:/Windows/Temp: array: ["\x{105}"] while: ["\x{110}"]

      __DATA__ contains an example of output. It changes for every run.

      I have no access to the directories using the MS File explorer.

      When I in FS_sweep skip those directories, 'C:/Users' and C:/Windows/System32/spool' there are no freezes.

        Now it's getting sad/worrisome/interesting. With my old 5.26 Perl, your script above reports empty arrays as expected (and, as a side note, some reasonable content if run with administrative rights). Also, I finally did mega-stress-test (sweep C:/ for 100 times), it completes OK.

        Then I tried latest recommended Strawberry Perl versions,,, -- under 5.28 and 5.30 results are as you report: random bytes/single characters for "protected operating system" (as MS UI calls them) folders. The "mega-test" (tried once) hung on 5th sweep, though I was able to ctrl-C it. So, definitely looks like a bug (somewhere in Win32::LongPath XS guts?).

        Update. Actually, the bug is not related to "special system" directories, and I'm no longer sure Win32::LongPath under 5.26 behaved correctly, too. Maybe the call to opendirL shouldn't succeed in the first place.

        >perl -v This is perl 5, version 30, subversion 2 (v5.30.2) built for MSWin32-x +64-multi-thread Copyright 1987-2020, Larry Wall Perl may be copied only under the terms of either the Artistic License + or the GNU General Public License, which may be found in the Perl 5 source ki +t. Complete documentation for Perl, including FAQ lists, should be found +on this system using "man perl" or "perldoc perl". If you have access to + the Internet, point your browser at, the Perl Home Pa +ge. >mkdir abc >perl -MData::Dump -MWin32::LongPath -E "for (1..5) {my $lp=Win32::Lon +gPath->new; $lp->opendirL('abc') or die; dd [$lp->readdirL]}" [".", ".."] [".", ".."] [".", ".."] [".", ".."] [".", ".."] >icacls abc /deny Everyone:F processed file: abc Successfully processed 1 files; Failed processing 0 files >perl -MData::Dump -MWin32::LongPath -E "for (1..5) {my $lp=Win32::Lon +gPath->new; $lp->opendirL('abc') or die; dd [$lp->readdirL]}" ["\b"] ["\b"] ["\35"] ["\32"] ["'"] >perl -MData::Dump -MWin32::LongPath -E "for (1..5) {my $lp=Win32::Lon +gPath->new; $lp->opendirL('abc') or die; dd [$lp->readdirL]}" ["\3"] ["\26"] ["#"] ["\22"] [4] >perl -MData::Dump -MWin32::LongPath -E "for (1..5) {my $lp=Win32::Lon +gPath->new; $lp->opendirL('abc') or die; dd [$lp->readdirL]}" ["M"] ["~"] ["M"] ["w"] ["I"] >perl -MData::Dump -MWin32::LongPath -E "for (1..5) {my $lp=Win32::Lon +gPath->new; $lp->opendirL('abc') or die; dd [$lp->readdirL]}" ["M"] ["}"] ["C"] ["l"] ["}"] >icacls abc /remove:d Everyone processed file: abc Successfully processed 1 files; Failed processing 0 files >perl -MData::Dump -MWin32::LongPath -E "for (1..5) {my $lp=Win32::Lon +gPath->new; $lp->opendirL('abc') or die; dd [$lp->readdirL]}" [".", ".."] [".", ".."] [".", ".."] [".", ".."] [".", ".."] >rmdir abc >

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11118300]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2022-01-27 09:56 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (70 votes). Check out past polls.