Benchmark.
It may very well be that whatever you come up with is only faster than File::Find on some setups of disks/volume managers/filesystem, and slower on others.
You need to benchmark to find out.
Now, in theory, carefully handcrafting something that does exactly what you need is going to be faster than a more general setup than File::Find. But whether that's actually be measurable is a different question.
So, benchmark.
Of course, as you describe the problem, the bottleneck might very well be your I/O.
Hence, benchmark.
Have I said you should benchmark? No? Well, benchmark! | [reply] |
$rule=File::Find::Object::Rule->new();
$rule->Some_filter_method_read_library_examples(parameters)->eventuall
+y_next_filter();
$rule->start(path_or_array_of_paths); #here will be initialized iterat
+or. don't panic, it will not load all big directory structure
while (){
my $item=$rule->match(); #read one single item. I prefer do it here,
+ it prevents matching name as while loop break
last unless defined $item; #stop looping after last element
#here do anything with $item, it is path, example:
printf "Fetched [%s]\n",$item;
if (-l $item) {print "it is symbolic link\n"};
};
you can leave this loop in any state, and for example start next scanning by calling next $rule->start(@new_searches). It will be reinitialized, for me it works. Of course, in that situation you'' use identical filters as previous. If you want do with different filters, call .....->new() and $rule->some_filters() again.
warning, this is fork from library File::Find::Rule and File::Find, currently unmaintained for a long time. this notice I found on metacpan.
| [reply] [d/l] |
use File::Find;
It's fast enough, and will do the job for you. | [reply] |
I am sure the bottleneck is going to be the disk I/O, so there is little you can do from Perl to speed the operation.
...well, maybe the order used to traverse the file system (deep or breadth first) could have some influence. | [reply] |
I've been a devotee of using the compiled unix/linux "find" utility in preference to File::Find or any of its variants/derivatives, because I found that for any directory tree of significant size, something like this:
open( FIND, '-|', 'find', $path, @args );
while (<FIND>) {
....
}
was significantly faster. In fact, just for grins, I tried an old benchmark that I posted here several years ago, to see if the results were still true with reasonably modern versions (perl 5.10.0 on macosx, File::Find 1.12), and I found an order of magnitude difference on a reasonably large tree (~30K files, 90 sec using File::Find, 9 sec using "find".)
But then I ran into a case where someone had created a really obscene quantity of files in a single directory on a freebsd file server, and freebsd's "find" utility choked. (Apparently, that version of "find" was building some sort of in-memory storage for each directory, and it hit a massive number of page faults on the path in question.)
I reverted to a recursive opendir/readdir approach for that case, and it succeeded reasonably well. Under "normal" conditions, compiled "find" seems to run about 10% faster than using recursive opendir/readdir, but in that particular case of an "abnormal" directory, freebsd "find" became effectively unusable, while opendir/readdir performance was consistent with normal conditions.
I just posted a utility I wrote for scanning directories, which uses recursive opendir/readdir: Get useful info about a directory tree -- I'm sure it includes a lot of baggage that you don't need, but perhaps it won't be too hard to pick out the useful bits...
| [reply] [d/l] |
| [reply] |
Unless the OP needs the entire list of "tens of millions of files", I'd suggest not using File::Find::Rule, and I'd instead suggest an iterator or callback based routine (like File::Find). If you can process files one at a time, there's no need to build such a huge list.
| [reply] |
runrig,
I don't follow you. File::Find::Rule does not simply return every file (although it can). You can instruct it what to return based on type, size, name, and even make a determination via a custom sub. You can have the sub perform actions and ignore the larger return value, or iterate with the start and match methods.
| [reply] [d/l] [select] |