glasswalk3r has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I'm looking for a fast way to list the contents of a directory (with thousands of files) on Linux by using Perl.

I did some research on that and found a sample C code that uses the getdents system call for that. By using it, one can avoid calling stat on each file inside the directory (basically what ls command does).

I did some tests with readdir, but performance speed compared to the already mentioned C code is not as good. That said, I'm inclined to try to use Perl syscall to do the same. Below is the C code (for those inclined to read it):

#define _GNU_SOURCE
#include <dirent.h>     /* Defines DT_* constants */
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/syscall.h>

#define handle_error(msg) \
       do { perror(msg); exit(EXIT_FAILURE); } while (0)

struct linux_dirent {
   long           d_ino;
   off_t          d_off;
   unsigned short d_reclen;
   char           d_name[];
};

#define BUF_SIZE 1024*1024*5

int
main(int argc, char *argv[])
{
   int fd, nread;
   char buf[BUF_SIZE];
   struct linux_dirent *d;
   int bpos;
   char d_type;

   fd = open(argc > 1 ? argv[1] : ".", O_RDONLY | O_DIRECTORY);
   if (fd == -1)
       handle_error("open");

   for ( ; ; ) {
       nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
       if (nread == -1)
           handle_error("getdents");

       if (nread == 0)
           break;

       for (bpos = 0; bpos < nread;) {
           d = (struct linux_dirent *) (buf + bpos);
           if (d->d_ino != 0) printf("%s\n", (char *) d->d_name);
           bpos += d->d_reclen;
       }
   }

   exit(EXIT_SUCCESS);
}
[download]

This is how the C struct should look like:

struct linux_dirent {
   unsigned long  d_ino;     /* Inode number 32*/
   unsigned long  d_off;     /* Offset to next linux_dirent 32*/
   unsigned short d_reclen;  /* Length of this linux_dirent 16*/
   char           d_name[];  /* Filename (null-terminated) */
                     /* length is actually (d_reclen - 2 -
                        offsetof(struct linux_dirent, d_name)) */
}
[download]

Since I'm not a C programmer, I struggling to achieve that. I found that I need to use unpack to retrieve the information from the related C struct, but I'm lost about:

Finding out the lenght I need to setup the Perl equivalent to the buffer (a scalar set with NUL characters, as my $buffer = "\0" x 64;), specially because the related C structure has a char array with dynamic length
The buffer will retain a N number of dentries inside of it. How can I find the exactly number of bytes each dentrie has and how can I jump from one entry to the other with Perl?

Is it even possible to do that without having to use XS (or any of it's alternatives)? I found Convert::Binary::C to give a hand, but probably I'm not using it correctly due the 2 issues above. If I use Data::Dumper on the buffer, I can see the file names, but got only garbage from Convert::Binary::C.

Here is my (not working) Perl code implementation:

#!/usr/bin/env perl
use warnings;
use strict;
use Cwd;
use File::Spec;
use Data::Dumper;
use Fcntl;
use Convert::Binary::C;
use constant BUF_SIZE => 4096;
use lib
'/home/myself/perl5/perls/perl-5.20.1/lib/site_perl/5.20.3/i686-linux/
+sys';
require 'syscall.ph';

my $dir = File::Spec->catdir( getcwd(), 'test' );
sysopen( my $fd, $dir, O_RDONLY | O_DIRECTORY );
my $buf = "\0" x 128;
$! = 0;
my $converter = Convert::Binary::C->new();
my $struct = <<CODE;
  struct foo {
    long d_ino;
    unsigned long d_off;
    unsigned short d_reclen;
    char d_name[];
  };
CODE
$converter->parse($struct);
my $read = syscall( &SYS_getdents, fileno($fd), $buf, BUF_SIZE );

if ( ( $read == -1 ) and ( $! != 0 ) ) {
    die "failed to syscal getdents: $!";
}

#print Dumper($read), "\n";
#print Dumper($buf), "\n";
close($fd);
my $data = $converter->unpack( 'foo', $buf );
print Dumper($data);
[download]

Thanks!

UPDATED

For the sake of others that may want to research about, I made available the module Linux::NFS::BigDir for that at CPAN, and here is the complete working code I got after all inputs from andal:

#!/usr/bin/env perl
use warnings;
use strict;
use File::Spec;
use Getopt::Std;
use Fcntl;
use constant BUF_SIZE => 4096;
use lib
'/home/myself/perl5/perls/perl-5.20.1/lib/site_perl/5.20.3/i686-linux/
+sys';
require 'syscall.ph';

my %opts;

getopts( 'd:', \%opts );

die 'option -d <DIRECTORY> is required'
  unless ( ( exists( $opts{d} ) ) and ( defined( $opts{d} ) ) );

sysopen( my $fd, $opts{d}, O_RDONLY | O_DIRECTORY );

while (1) {

    my $buf = "\0" x BUF_SIZE;
    my $read = syscall( &SYS_getdents, fileno($fd), $buf, BUF_SIZE );

    if ( ( $read == -1 ) and ( $! != 0 ) ) {
        die "failed to syscal getdents: $!";
    }

    last if ( $read == 0 );

    while ( $read != 0 ) {
        my ( $ino, $off, $len, $name ) = unpack( "LLSZ*", $buf );

        #print $name, "\n" if ( $ino != 0 );
        unless ( ( $name eq '.' ) or ( $name eq '..' ) ) {
            my $path = File::Spec->catfile( $opts{d}, $name );
            unlink $path or die "Cannot remove $path: $!";
        }
        substr( $buf, 0, $len ) = '';
        $read -= $len;
    }

}

close($fd);
[download]

Alceu Rodrigues de Freitas Junior
---------------------------------
"You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

Comment on using Linux getdents syscall Select or Download Code

Replies are listed 'Best First'.
Re: using Linux getdents syscall by andal (Hermit) on Nov 24, 2015 at 09:04 UTC
I'm not going to discuss what is more efficient, syscall or readdir. It is up to you to check and decide. I'll just try to give you some hints for the use of pack/unpack and/or Convert::Binary::C Your first mistake, is the difference between actual size of buffer, and the declared size of buffer. You do `my $buf = "\0" x 128;` which reserves 128 bytes of buffer, but to the syscall you say that your buffer is BUF_SIZE long, which is 4096. You'll end up with corrupted memory, or even SEGFAULT. Then, you have to check for syscall returning 0. In this case, there's nothing to unpack. Finally. Returned buffer shall contain one or more records of variable length. You have to configure Convert::Binary::C so that it is prepared to handle such records. Read "The Dimension Tag" section in the module documentation. So your algorithm shall be "while $read != 0 convert first structure; $read -= length of converted; remove processed part from buffer". You don't have to use Convert::Binary::C. Using perls native "unpack" you can process this structure like this `while($read != 0) { my($ino, $off, $len, $name) = unpack("LLSZ*", $buf); print $name, "\n" if($ino != 0); substr($buf, 0, $len) = ''; $read -= $len; }` [download]	[reply] [d/l] [select]
Re^2: using Linux getdents syscall by glasswalk3r (Friar) on Nov 24, 2015 at 12:18 UTC
Thanks for the tips! Considering that `$buf` might have multiple dentries inside of it, how do I navigate thru all those records? For C is easy because I'm accessing memory directly, but how do I know how many bytes I should read from `$buf` and how to move to the next record with Perl? Alceu Rodrigues de Freitas Junior --------------------------------- "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill	[reply] [d/l] [select]
Re^3: using Linux getdents syscall by andal (Hermit) on Nov 25, 2015 at 07:50 UTC
I've already put it in the code. The variable $len contains total length of single dirent structure. So, all you have to do is to remove this many bytes from the beginning of the buffer.	[reply]
Re^2: using Linux getdents syscall by ikegami (Patriarch) on Jul 17, 2020 at 19:34 UTC
For `unsigned long`, you should use `L!`, not `L`. For `unsigned short`, you should use `S!`, not `S`. Ref	[reply] [d/l] [select]
Re: using Linux getdents syscall by afoken (Chancellor) on Nov 24, 2015 at 06:58 UTC
Quoting the man page of getdents: These are not the interfaces you are interested in. Look at readdir(3) for the POSIX-conforming C library interface. That's all that you need to know about getdents(2). Perl has a readdir function that calls readdir(3) internally, and I'm quite sure it is optimized. Readdir(3) itself is most likely implemented in the libc as calling getdents(2), with a fallback to readdir(2) for older kernels. I'm looking for a fast way to list the contents of a directory (with thousands of files) on Linux by using Perl. opendir, readdir, closedir. Benchmark that. Compare with ls. Most likely, you won't get faster than that, simply because perl has higher startup costs and does not run native code, but instead follows a complex data structure representing your perl script. My guess is that the bottleneck is the disk and its interface, not the actual functions called to read the directory. Sure, libc and perl add some overhead, but not that much. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^2: using Linux getdents syscall by glasswalk3r (Friar) on Nov 24, 2015 at 12:10 UTC
I don't care about portability or be POSIX conform... I need to erase lots of files the faster (and with low overhead) as possible in Linux. readdir works fine, but when the directory starts having thousands of files performance starts slowing down. It just occurred to me right now that I could check if readdir does a stat system call in each file inside the directory... that would explain why Perl code to clean the directory is slower. But wouldn't help me solve this issue anyway. Alceu Rodrigues de Freitas Junior --------------------------------- "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill	[reply]
Re^3: using Linux getdents syscall by KurtSchwind (Chaplain) on Nov 24, 2015 at 19:19 UTC
There has already been lots of great input on this topic. I'll add that there is no faster way on a nix box to delete a large number of files from a directory than using `xargs`. Either with: `ls [SOME MASK] \|xargs rm` or `ls \|grep [SOME MASK] \|xargs rm` -- �For the Present is the point at which time touches eternity.� - CS Lewis*	[reply] [d/l] [select]
Re: using Linux getdents syscall by choroba (Cardinal) on Nov 24, 2015 at 14:18 UTC
I remember reading Perl to the rescue: case study of deleting a large directory, it might be helpful. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re: using Linux getdents syscall by Anonymous Monk on Nov 24, 2015 at 01:21 UTC
The simplest thing to do is to compile the C program and run it from your Perl program. Use `qx()`, or `open my $pipe, '-\|', '/your/c/program'`, and just read its output. Unless I'm missing something, I don't see any reason to use Perl's `syscall`, unpack structs, allocate buffers and do other complicated things.	[reply] [d/l] [select]
Re^2: using Linux getdents syscall by afoken (Chancellor) on Nov 24, 2015 at 07:00 UTC
So, you think that starting an external process just to list directories has LESS overhead than opendir/readdir/closedir? Sorry, but that is nonsense! See Re: using Linux getdents syscall Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^2: using Linux getdents syscall by glasswalk3r (Friar) on Nov 24, 2015 at 12:12 UTC
I would rather remove the file from inside the C code and avoid doing another system call with Perl... but there is no fun in doing that. Alceu Rodrigues de Freitas Junior --------------------------------- "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill	[reply]
Re: using Linux getdents syscall by oiskuu (Hermit) on Nov 24, 2015 at 17:41 UTC
The performance problems you're encountering are almost certainly due to the mechanics of the underlying filesystem, or its metadata handling to be more precise. Choosing the right filesystem and the right options can have tremendous impact is some cases, but be sure to understand the implications. Linux native ext4 with default options is a good all-around choice. For temporary files, the tmpfs is an excellent candidate. If you must use ext4 filesystem for backing lots and lots of transient files, however, do consider mounting it without journal (-o noload). Or the opposite—perhaps a generously sized journal might save the day. In any case, taking a step back to re-evaluate your optimization approaches would be in order.	[reply]
Re^2: using Linux getdents syscall by glasswalk3r (Friar) on Dec 03, 2015 at 23:55 UTC
You're right oiskuu: I tested the code in a different situation and didn't get a better performance compared to other solutions (`rm -f ` was the better option if all files were to be removed). In the other hand, it did showed faster performance than using `readdir()`. I believed if I retrieve all files names at once before* trying to apply any checking or removing them would be faster, but I still need to benchmark it This is going a bit outside of scope of Perl, but I guess that after all the only way to detect this kind of issue with the filesystem is having a baseline to compare with. Alceu Rodrigues de Freitas Junior --------------------------------- "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom