http://qs321.pair.com?node_id=1148448

glasswalk3r has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I'm looking for a fast way to list the contents of a directory (with thousands of files) on Linux by using Perl.

I did some research on that and found a sample C code that uses the getdents system call for that. By using it, one can avoid calling stat on each file inside the directory (basically what ls command does).

I did some tests with readdir, but performance speed compared to the already mentioned C code is not as good. That said, I'm inclined to try to use Perl syscall to do the same. Below is the C code (for those inclined to read it):

#define _GNU_SOURCE #include <dirent.h> /* Defines DT_* constants */ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <sys/stat.h> #include <sys/syscall.h> #define handle_error(msg) \ do { perror(msg); exit(EXIT_FAILURE); } while (0) struct linux_dirent { long d_ino; off_t d_off; unsigned short d_reclen; char d_name[]; }; #define BUF_SIZE 1024*1024*5 int main(int argc, char *argv[]) { int fd, nread; char buf[BUF_SIZE]; struct linux_dirent *d; int bpos; char d_type; fd = open(argc > 1 ? argv[1] : ".", O_RDONLY | O_DIRECTORY); if (fd == -1) handle_error("open"); for ( ; ; ) { nread = syscall(SYS_getdents, fd, buf, BUF_SIZE); if (nread == -1) handle_error("getdents"); if (nread == 0) break; for (bpos = 0; bpos < nread;) { d = (struct linux_dirent *) (buf + bpos); if (d->d_ino != 0) printf("%s\n", (char *) d->d_name); bpos += d->d_reclen; } } exit(EXIT_SUCCESS); }

This is how the C struct should look like:

struct linux_dirent { unsigned long d_ino; /* Inode number 32*/ unsigned long d_off; /* Offset to next linux_dirent 32*/ unsigned short d_reclen; /* Length of this linux_dirent 16*/ char d_name[]; /* Filename (null-terminated) */ /* length is actually (d_reclen - 2 - offsetof(struct linux_dirent, d_name)) */ }

Since I'm not a C programmer, I struggling to achieve that. I found that I need to use unpack to retrieve the information from the related C struct, but I'm lost about:

Is it even possible to do that without having to use XS (or any of it's alternatives)? I found Convert::Binary::C to give a hand, but probably I'm not using it correctly due the 2 issues above. If I use Data::Dumper on the buffer, I can see the file names, but got only garbage from Convert::Binary::C.

Here is my (not working) Perl code implementation:

#!/usr/bin/env perl use warnings; use strict; use Cwd; use File::Spec; use Data::Dumper; use Fcntl; use Convert::Binary::C; use constant BUF_SIZE => 4096; use lib '/home/myself/perl5/perls/perl-5.20.1/lib/site_perl/5.20.3/i686-linux/ +sys'; require 'syscall.ph'; my $dir = File::Spec->catdir( getcwd(), 'test' ); sysopen( my $fd, $dir, O_RDONLY | O_DIRECTORY ); my $buf = "\0" x 128; $! = 0; my $converter = Convert::Binary::C->new(); my $struct = <<CODE; struct foo { long d_ino; unsigned long d_off; unsigned short d_reclen; char d_name[]; }; CODE $converter->parse($struct); my $read = syscall( &SYS_getdents, fileno($fd), $buf, BUF_SIZE ); if ( ( $read == -1 ) and ( $! != 0 ) ) { die "failed to syscal getdents: $!"; } #print Dumper($read), "\n"; #print Dumper($buf), "\n"; close($fd); my $data = $converter->unpack( 'foo', $buf ); print Dumper($data);

Thanks!

UPDATED

For the sake of others that may want to research about, I made available the module Linux::NFS::BigDir for that at CPAN, and here is the complete working code I got after all inputs from andal:

#!/usr/bin/env perl use warnings; use strict; use File::Spec; use Getopt::Std; use Fcntl; use constant BUF_SIZE => 4096; use lib '/home/myself/perl5/perls/perl-5.20.1/lib/site_perl/5.20.3/i686-linux/ +sys'; require 'syscall.ph'; my %opts; getopts( 'd:', \%opts ); die 'option -d <DIRECTORY> is required' unless ( ( exists( $opts{d} ) ) and ( defined( $opts{d} ) ) ); sysopen( my $fd, $opts{d}, O_RDONLY | O_DIRECTORY ); while (1) { my $buf = "\0" x BUF_SIZE; my $read = syscall( &SYS_getdents, fileno($fd), $buf, BUF_SIZE ); if ( ( $read == -1 ) and ( $! != 0 ) ) { die "failed to syscal getdents: $!"; } last if ( $read == 0 ); while ( $read != 0 ) { my ( $ino, $off, $len, $name ) = unpack( "LLSZ*", $buf ); #print $name, "\n" if ( $ino != 0 ); unless ( ( $name eq '.' ) or ( $name eq '..' ) ) { my $path = File::Spec->catfile( $opts{d}, $name ); unlink $path or die "Cannot remove $path: $!"; } substr( $buf, 0, $len ) = ''; $read -= $len; } } close($fd);
Alceu Rodrigues de Freitas Junior
---------------------------------
"You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

Replies are listed 'Best First'.
Re: using Linux getdents syscall
by andal (Hermit) on Nov 24, 2015 at 09:04 UTC

    I'm not going to discuss what is more efficient, syscall or readdir. It is up to you to check and decide. I'll just try to give you some hints for the use of pack/unpack and/or Convert::Binary::C

    Your first mistake, is the difference between actual size of buffer, and the declared size of buffer. You do my $buf = "\0" x 128; which reserves 128 bytes of buffer, but to the syscall you say that your buffer is BUF_SIZE long, which is 4096. You'll end up with corrupted memory, or even SEGFAULT.

    Then, you have to check for syscall returning 0. In this case, there's nothing to unpack.

    Finally. Returned buffer shall contain one or more records of variable length. You have to configure Convert::Binary::C so that it is prepared to handle such records. Read "The Dimension Tag" section in the module documentation. So your algorithm shall be "while $read != 0 convert first structure; $read -= length of converted; remove processed part from buffer".

    You don't have to use Convert::Binary::C. Using perls native "unpack" you can process this structure like this

    while($read != 0) { my($ino, $off, $len, $name) = unpack("LLSZ*", $buf); print $name, "\n" if($ino != 0); substr($buf, 0, $len) = ''; $read -= $len; }

      Thanks for the tips!

      Considering that $buf might have multiple dentries inside of it, how do I navigate thru all those records? For C is easy because I'm accessing memory directly, but how do I know how many bytes I should read from $buf and how to move to the next record with Perl?

      Alceu Rodrigues de Freitas Junior
      ---------------------------------
      "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

        I've already put it in the code. The variable $len contains total length of single dirent structure. So, all you have to do is to remove this many bytes from the beginning of the buffer.

      For unsigned long, you should use L!, not L.

      For unsigned short, you should use S!, not S.

      Ref

Re: using Linux getdents syscall
by afoken (Chancellor) on Nov 24, 2015 at 06:58 UTC

    Quoting the man page of getdents:

    These are not the interfaces you are interested in. Look at readdir(3) for the POSIX-conforming C library interface.

    That's all that you need to know about getdents(2). Perl has a readdir function that calls readdir(3) internally, and I'm quite sure it is optimized. Readdir(3) itself is most likely implemented in the libc as calling getdents(2), with a fallback to readdir(2) for older kernels.

    I'm looking for a fast way to list the contents of a directory (with thousands of files) on Linux by using Perl.

    opendir, readdir, closedir. Benchmark that. Compare with ls. Most likely, you won't get faster than that, simply because perl has higher startup costs and does not run native code, but instead follows a complex data structure representing your perl script.

    My guess is that the bottleneck is the disk and its interface, not the actual functions called to read the directory. Sure, libc and perl add some overhead, but not that much.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      I don't care about portability or be POSIX conform... I need to erase lots of files the faster (and with low overhead) as possible in Linux.

      readdir works fine, but when the directory starts having thousands of files performance starts slowing down.

      It just occurred to me right now that I could check if readdir does a stat system call in each file inside the directory... that would explain why Perl code to clean the directory is slower. But wouldn't help me solve this issue anyway.

      Alceu Rodrigues de Freitas Junior
      ---------------------------------
      "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

        There has already been lots of great input on this topic. I'll add that there is no faster way on a *nix box to delete a large number of files from a directory than using xargs.

        Either with:  ls [SOME MASK] |xargs rm

        or ls |grep [SOME MASK] |xargs rm

        --
        “For the Present is the point at which time touches eternity.” - CS Lewis
Re: using Linux getdents syscall
by choroba (Cardinal) on Nov 24, 2015 at 14:18 UTC
Re: using Linux getdents syscall
by Anonymous Monk on Nov 24, 2015 at 01:21 UTC
    The simplest thing to do is to compile the C program and run it from your Perl program. Use qx(), or open my $pipe, '-|', '/your/c/program', and just read its output. Unless I'm missing something, I don't see any reason to use Perl's syscall, unpack structs, allocate buffers and do other complicated things.

      So, you think that starting an external process just to list directories has LESS overhead than opendir/readdir/closedir? Sorry, but that is nonsense! See Re: using Linux getdents syscall

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      I would rather remove the file from inside the C code and avoid doing another system call with Perl... but there is no fun in doing that.

      Alceu Rodrigues de Freitas Junior
      ---------------------------------
      "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill
Re: using Linux getdents syscall
by oiskuu (Hermit) on Nov 24, 2015 at 17:41 UTC

    The performance problems you're encountering are almost certainly due to the mechanics of the underlying filesystem, or its metadata handling to be more precise.

    Choosing the right filesystem and the right options can have tremendous impact is some cases, but be sure to understand the implications. Linux native ext4 with default options is a good all-around choice. For temporary files, the tmpfs is an excellent candidate.

    If you must use ext4 filesystem for backing lots and lots of transient files, however, do consider mounting it without journal (-o noload). Or the opposite—perhaps a generously sized journal might save the day. In any case, taking a step back to re-evaluate your optimization approaches would be in order.

      You're right oiskuu: I tested the code in a different situation and didn't get a better performance compared to other solutions (rm -f * was the better option if all files were to be removed). In the other hand, it did showed faster performance than using readdir().

      I believed if I retrieve all files names at once before trying to apply any checking or removing them would be faster, but I still need to benchmark it

      This is going a bit outside of scope of Perl, but I guess that after all the only way to detect this kind of issue with the filesystem is having a baseline to compare with.

      Alceu Rodrigues de Freitas Junior
      ---------------------------------
      "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill