Hello monks,
I'm looking for a fast way to list the contents of a directory (with thousands of files) on Linux by using Perl.
I did some research on that and found a sample C code that uses the getdents system call for that. By using it, one can avoid calling stat on each file inside the directory (basically what ls command does).
I did some tests with readdir, but performance speed compared to the already mentioned C code is not as good. That said, I'm inclined to try to use Perl syscall to do the same. Below is the C code (for those inclined to read it):
#define _GNU_SOURCE
#include <dirent.h> /* Defines DT_* constants */
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
struct linux_dirent {
long d_ino;
off_t d_off;
unsigned short d_reclen;
char d_name[];
};
#define BUF_SIZE 1024*1024*5
int
main(int argc, char *argv[])
{
int fd, nread;
char buf[BUF_SIZE];
struct linux_dirent *d;
int bpos;
char d_type;
fd = open(argc > 1 ? argv[1] : ".", O_RDONLY | O_DIRECTORY);
if (fd == -1)
handle_error("open");
for ( ; ; ) {
nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
if (nread == -1)
handle_error("getdents");
if (nread == 0)
break;
for (bpos = 0; bpos < nread;) {
d = (struct linux_dirent *) (buf + bpos);
if (d->d_ino != 0) printf("%s\n", (char *) d->d_name);
bpos += d->d_reclen;
}
}
exit(EXIT_SUCCESS);
}
This is how the C struct should look like:
struct linux_dirent {
unsigned long d_ino; /* Inode number 32*/
unsigned long d_off; /* Offset to next linux_dirent 32*/
unsigned short d_reclen; /* Length of this linux_dirent 16*/
char d_name[]; /* Filename (null-terminated) */
/* length is actually (d_reclen - 2 -
offsetof(struct linux_dirent, d_name)) */
}
Since I'm not a C programmer, I struggling to achieve that. I found that I need to use unpack to retrieve the information from the related C struct, but I'm lost about:
- Finding out the lenght I need to setup the Perl equivalent to the buffer (a scalar set with NUL characters, as my $buffer = "\0" x 64;), specially because the related C structure has a char array with dynamic length
- The buffer will retain a N number of dentries inside of it. How can I find the exactly number of bytes each dentrie has and how can I jump from one entry to the other with Perl?
Is it even possible to do that without having to use XS (or any of it's alternatives)? I found Convert::Binary::C to give a hand, but probably I'm not using it correctly due the 2 issues above. If I use Data::Dumper on the buffer, I can see the file names, but got only garbage from Convert::Binary::C.
Here is my (not working) Perl code implementation:
#!/usr/bin/env perl
use warnings;
use strict;
use Cwd;
use File::Spec;
use Data::Dumper;
use Fcntl;
use Convert::Binary::C;
use constant BUF_SIZE => 4096;
use lib
'/home/myself/perl5/perls/perl-5.20.1/lib/site_perl/5.20.3/i686-linux/
+sys';
require 'syscall.ph';
my $dir = File::Spec->catdir( getcwd(), 'test' );
sysopen( my $fd, $dir, O_RDONLY | O_DIRECTORY );
my $buf = "\0" x 128;
$! = 0;
my $converter = Convert::Binary::C->new();
my $struct = <<CODE;
struct foo {
long d_ino;
unsigned long d_off;
unsigned short d_reclen;
char d_name[];
};
CODE
$converter->parse($struct);
my $read = syscall( &SYS_getdents, fileno($fd), $buf, BUF_SIZE );
if ( ( $read == -1 ) and ( $! != 0 ) ) {
die "failed to syscal getdents: $!";
}
#print Dumper($read), "\n";
#print Dumper($buf), "\n";
close($fd);
my $data = $converter->unpack( 'foo', $buf );
print Dumper($data);
Thanks!
UPDATED
For the sake of others that may want to research about, I made available the module Linux::NFS::BigDir for that at CPAN, and here is the complete working code I got after all inputs from andal:
#!/usr/bin/env perl
use warnings;
use strict;
use File::Spec;
use Getopt::Std;
use Fcntl;
use constant BUF_SIZE => 4096;
use lib
'/home/myself/perl5/perls/perl-5.20.1/lib/site_perl/5.20.3/i686-linux/
+sys';
require 'syscall.ph';
my %opts;
getopts( 'd:', \%opts );
die 'option -d <DIRECTORY> is required'
unless ( ( exists( $opts{d} ) ) and ( defined( $opts{d} ) ) );
sysopen( my $fd, $opts{d}, O_RDONLY | O_DIRECTORY );
while (1) {
my $buf = "\0" x BUF_SIZE;
my $read = syscall( &SYS_getdents, fileno($fd), $buf, BUF_SIZE );
if ( ( $read == -1 ) and ( $! != 0 ) ) {
die "failed to syscal getdents: $!";
}
last if ( $read == 0 );
while ( $read != 0 ) {
my ( $ino, $off, $len, $name ) = unpack( "LLSZ*", $buf );
#print $name, "\n" if ( $ino != 0 );
unless ( ( $name eq '.' ) or ( $name eq '..' ) ) {
my $path = File::Spec->catfile( $opts{d}, $name );
unlink $path or die "Cannot remove $path: $!";
}
substr( $buf, 0, $len ) = '';
$read -= $len;
}
}
close($fd);
Alceu Rodrigues de Freitas Junior
---------------------------------
"You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.