greater efficiency required (ls, glob, or readdir?)

jperlq has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: greater efficiency required (ls, glob, or readdir?) by ikegami (Patriarch) on Aug 27, 2008 at 18:01 UTC
In theory, all solutions will boil down to a readdir (the system call), so it's likely to be the fastest. In practice, the speed difference between the different methods is probably minor, so you should use the method you find easiest to use and maintain. If you find that too slow, then come to us. My personal opinion is that running a child process to get a directory listing is a rather silly thing to do. I wouldn't use `ls`. Doubly so for using `cat` for reading a file!!!	[reply] [d/l] [select]
Re^2: greater efficiency required (ls, glob, or readdir?) by betterworld (Curate) on Aug 27, 2008 at 18:24 UTC
My personal opinion is that running a child process to get a directory listing is a rather silly thing to do. I wouldn't use ls. Doubly so for using cat for reading a file!!! I second that; especially because `$dir` and `$_` will be interpreted by the shell. So you will get problems if a directory name or entry has special characters in it. Even if you don't think that this is important in your case, it's better to make the code more maintainable and re-usable for security-aware scenarios. While you can avoid these problems by using `open my $pipe, '-\|', 'ls', $dir`, it's really not worth the trouble; readdir (or IO::Dir) has less problems. And for reading the file, use open or File::Slurp.	[reply] [d/l] [select]
Re^3: greater efficiency required (ls, glob, or readdir?) by JavaFan (Canon) on Aug 27, 2008 at 19:37 UTC
Actually, $dir and $_ will only be interpreted by the shell if they contain "funny" characters. If a string passed as argument to qx or one-arg system contains just alphanums, underscores and whitespace, no shell gets involved, perl will call execv directly. But obviously, if you don't know what $dir contains, you shouldn't use `ls $dir`, and if you aren't in control of the content of the directory you should use `cat $_`. Even if you don't think that this is important in your case, it's better to make the code more maintainable and re-usable for security-aware scenarios. I cannot agree with that, if only for the reasons it's often not mutually possible. More maintable code usually means simpler code, while code that needs to be run in a possible hostile environment tends to be more complex than code that doesn't have to run in such an environment. "More maintable" and "re-usable for security aware scenarios" are most of the time conflicting requirements.	[reply]
Re^2: greater efficiency required (ls, glob, or readdir?) by JavaFan (Canon) on Aug 27, 2008 at 18:22 UTC
Actually, using ls has some advantages. For instance, the `opendir`/`readdir` solution presented by jwkrahn below will try to open the current and parent directory as if they were files. A plain 'ls' will not return any names starting with a dot. The equivalent of my @files = `ls $dir`; [download] is `my @files; { opendir my $dh, $dir or die; @files = grep {!/^\./} readdir $dh; closedir $dh; }` [download] As for using cat to read a file, it's something I do often. It's simple. It's a short one liner. Doing it in pure Perl requires several lines, or something cryptic. `my $info = do {local (@ARGV, $/) = $file; <>}; # Cryptic # 7 Lines. my $info; { open my $fh, "<", $file or die; local $/; $info = <$fh>; close $fh; }` [download]	[reply] [d/l] [select]
Re^3: greater efficiency required (ls, glob, or readdir?) by ikegami (Patriarch) on Aug 27, 2008 at 18:33 UTC
using ls has some advantages. That's why some use `glob`. Doing it in pure Perl requires several lines It would take more than 7 lines to do the equivalent of `or die $!` when using `cat`. It's so complex you probably don't even bother doing it. The OP's code is the perfect example. By using `cat`, he used three lines instead of two, he removed the error checking he'd do with `open`, he introduced a lot of overhead in a loop, he introduced a bug that deletes trailing blank lines, and he introduced a bug for files with spaces and other special characters in their names. Update: Added OP as an example.	[reply] [d/l] [select]
Re^4: greater efficiency required (ls, glob, or readdir?) by JavaFan (Canon) on Aug 27, 2008 at 19:23 UTC
Re^5: greater efficiency required (ls, glob, or readdir?) by ikegami (Patriarch) on Aug 27, 2008 at 19:34 UTC
Some notes below your chosen depth have not been shown here
Re^3: greater efficiency required (ls, glob, or readdir?) by shmem (Chancellor) on Aug 27, 2008 at 19:53 UTC
The equivalent of ... is ... No, it is not. ls(1) will call getdents(2) until that doesn't return any more directory entries and finally output the list, while perl's readdir returns control after each call to getdents(2). Depending on the size of the directory to be read and the tasks to be done on each entry, it can make a big difference on distribution of iowait load (which, in sum, will be the same of course). I'm talking about linux here..	[reply]
Re^4: greater efficiency required (ls, glob, or readdir?) by JavaFan (Canon) on Aug 27, 2008 at 20:34 UTC
Re^5: greater efficiency required (ls, glob, or readdir?) by shmem (Chancellor) on Aug 27, 2008 at 20:55 UTC
Re: greater efficiency required (ls, glob, or readdir?) by broomduster (Priest) on Aug 27, 2008 at 19:28 UTC
If the directory is large and you want to do globbing of file names, then `ls` can be a bad idea. From the shell: `-> ls \| wc -l 61498 -> ls * bash: /bin/ls: Arg list too long` [download] You wouldn't actually do this to get the full directory listing, of course, but if the list of `something*.whatever` files that you want is too long, you're still hosed. In that case, you want a readdir / grep combo or glob (which doesn't suffer the same limitation as `ls`).	[reply] [d/l] [select]
Re: greater efficiency required (ls, glob, or readdir?) by jwkrahn (Abbot) on Aug 27, 2008 at 18:07 UTC
For efficiency you want something like this: `my $dir = '/Path/to/a/data/directory'; opendir my $DH, $dir or die "Cannot open '$dir' $!"; my %hash; while ( my $file = readdir $DH ) { next if $file =~ /~$/; open my $FH, '<', "$dir/$file" or die "Cannot open '$dir/$file' $! +"; while ( my $line = <$FH> ) { next if /^#/ \|\| !/\S/; my ( $key, @values ) = split /\t/; $hash{ $file }{ $key } = \@values; } }` [download] Update: I realised that the while loop is incorrect, it should be: `while ( my $line = <$FH> ) { next if $line =~ /^#/ \|\| $line !~ /\S/; my ( $key, @values ) = split /\t/, $line; $hash{ $file }{ $key } = \@values; }` [download]	[reply] [d/l] [select]
Re^2: greater efficiency required (ls, glob, or readdir?) by jperlq (Acolyte) on Aug 27, 2008 at 20:09 UTC
Thanks, I got your version to work with very few changes. `my $dir = '/path/to/data/directory'; my %hash; opendir my $DH, $dir or die "cannot open '$dir' $!"; while (my $file = readdir $DH ) { next if $file =~ /~$/; next if -d $file; open my $FH, "<", "$dir/$file" or die "Cannot open '$dir/$file +' $!"; while ( my $line = <$FH> ) { next if /^#/ \|\| !length($line); my ($key, @values ) = split(/\t/, $line); $hash{ $file }{ $key } = \@values; } }` [download] It even seems to work quite a bit faster than the ls/cat combo.	[reply] [d/l]
Re^3: greater efficiency required (ls, glob, or readdir?) by jwkrahn (Abbot) on Aug 27, 2008 at 21:03 UTC
You need to change: `next if -d $file;` to `next if -d "$dir/$file";` You need to change: `next if /^#/ \|\| !length($line);` to `next if $line =~ /^#/ \|\| !length($line);`	[reply] [d/l] [select]
Re^2: greater efficiency required (ls, glob, or readdir?) by linuxer (Curate) on Aug 27, 2008 at 19:09 UTC
what about skipping the directory entries? `my $dir = '/Path/to/a/data/directory'; opendir my $DH, $dir or die "Cannot open '$dir' $!"; my %hash; while ( my $file = readdir $DH ) { next if $file =~ /~$/; next if -d "$dir/$file"; # should also skip +'.' and '..' entries # read and process file }` [download] update: fix path issue "$dir/$file"	[reply] [d/l]
Re: greater efficiency required (ls, glob, or readdir?) by kyle (Abbot) on Aug 27, 2008 at 19:20 UTC
For fun, I did a rewrite. I haven't tested it, but it compiles cleanly. Read more... (1280 Bytes) As to your question, I think a grep on readdir would be the best way to go to get your list of files. The files themselves, you could process line-by-line instead of reading every line at once, and that might be better. You might be able to get the shell to do even more of the work for you, though. `my $dir = "/Path/to/a/data/directory"; my %hash; open my $grep_fh, '-\|', "grep '^' $dir/* /dev/null" or die "Can't grep: $!\n"; while ( my $line = <$grep_fh> ) { $line =~ s/^([^:])://; my $file = $1; next if $file =~ /\~$/; %{$hash{$file}} = (%{$hash{$file}}, %{hashify_line( $line )}); } close $grep_fh or die "Error closing grep pipe: $!\n";` [download] This way, you get grep and the shell to do all the I/O. Notes: I hope you don't have any filenames with a colon in them. This uses sub `hashify_line` as defined in the <readmore> above. (Hey, refactoring pays off!) We assume also that `$dir` does not contain any shell metacharacters. If yours isn't really a literal as in your example, you may have to sanitize it. You could probably get grep to do some of your line filtering for you, but I'd just as soon do that in Perl. Likewise, you could use find and xargs to choose the file list to pass to grep, and I'd really rather do that in Perl. Both of those "rather do that in Perl" statements may need to be reevaluated in light of performance problems. (For example, if you waste a lot of time ignoring lines.) Update: broomduster makes a good point in Re: greater efficiency required (ls, glob, or readdir?) also. If you have too many files, the "`$dir/*`" to the shell is going to bomb. Time for xargs, then. Something like: `my $cmd = "find $dir -type f -print \| xargs grep '^' /dev/null"; open my $grep_fh, '-\|', $cmd or die "Can't grep: $!\n";` [download] Then you may have to filter out dot files somewhere.	[reply] [d/l] [select]
Re: greater efficiency required (ls, glob, or readdir?) by superfrink (Curate) on Aug 28, 2008 at 04:48 UTC
`ls` sorts the list of files by default. If you don't care about the order of the files you can shave off a bit of time. `ls -f` will turn off sorting. The difference you see will depend on your filesystem and you will have to try it out to see how much faster it is. In my experience there is a small difference with ReiserFS but there is a noticeable difference with Solaris UFS.	[reply] [d/l] [select]
Re^2: greater efficiency required (ls, glob, or readdir?) by Hue-Bond (Priest) on Aug 28, 2008 at 05:51 UTC
`ls -f` will turn off sorting. The man page for `ls` on my linux system states that `ls -f` is equivalent to enabling `-a` (among other things). This would change the behaviour regarding plain `ls` so I would recommend using `ls -U` instead. -- David Serrano	[reply] [d/l] [select]
Re: greater efficiency required (ls, glob, or readdir?) by jethro (Monsignor) on Aug 27, 2008 at 18:19 UTC
`\|\| !$_` is possibly unsafe, you would ignore a file named '0' Also usually 'or' is better than '\|\|' because of lower precedence (doesn't make a difference here though)	[reply] [d/l]
Re^2: greater efficiency required (ls, glob, or readdir?) by ikegami (Patriarch) on Aug 27, 2008 at 18:48 UTC
`\|\| !length($_)` (Upd: or better yet `\|\| !/\S/` ) would be the solution. `\|\|` is fine. It's being used to join expressions, not statements.	[reply] [d/l] [select]


Perl: the Markov chain saw
	PerlMonks