Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

greater efficiency required (ls, glob, or readdir?)

by jperlq (Acolyte)
on Aug 27, 2008 at 17:54 UTC ( #707234=perlquestion: print w/replies, xml ) Need Help??

jperlq has asked for the wisdom of the Perl Monks concerning the following question:

I am looking for a more efficient means of getting data from a directory of tab separated files into hash. The directory is full of files in the format: key\tvalue0\tvalue1\tvalue2\nkey\tvalue0\tvalue1\n ... the individual files are small (~10 lines) but the directory can be fairly large (10,000 files). I am sure there is a better way to do it than the way i have been, so i come to the monks to find the best way. Here is what i have been doing.
my $dir = "/Path/to/a/data/directory"; my %hash; my @ls = `ls $dir`; foreach (@ls){ chomp; next if /\~$/ || !$_; my $file = $_; my $info = `cat $dir/$_`; my (@lines) = split(/\n/,$info); for (@lines){ s/^\s+//; next if /^\#/ || !$_; my ($key, @values) = split(/\t/); $hash{$file}{$key} = [@values]; } }
if there is a better way, glob or readdir, please give an example of how i should adapt my code?

Replies are listed 'Best First'.
Re: greater efficiency required (ls, glob, or readdir?)
by ikegami (Pope) on Aug 27, 2008 at 18:01 UTC

    In theory, all solutions will boil down to a readdir (the system call), so it's likely to be the fastest.

    In practice, the speed difference between the different methods is probably minor, so you should use the method you find easiest to use and maintain. If you find that too slow, *then* come to us.

    My personal opinion is that running a child process to get a directory listing is a rather silly thing to do. I wouldn't use ls. Doubly so for using cat for reading a file!!!

      My personal opinion is that running a child process to get a directory listing is a rather silly thing to do. I wouldn't use ls. Doubly so for using cat for reading a file!!!

      I second that; especially because $dir and $_ will be interpreted by the shell. So you will get problems if a directory name or entry has special characters in it.

      Even if you don't think that this is important in your case, it's better to make the code more maintainable and re-usable for security-aware scenarios.

      While you can avoid these problems by using open my $pipe, '-|', 'ls', $dir, it's really not worth the trouble; readdir (or IO::Dir) has less problems. And for reading the file, use open or File::Slurp.

        Actually, $dir and $_ will only be interpreted by the shell if they contain "funny" characters. If a string passed as argument to qx or one-arg system contains just alphanums, underscores and whitespace, no shell gets involved, perl will call execv directly.

        But obviously, if you don't know what $dir contains, you shouldn't use `ls $dir`, and if you aren't in control of the content of the directory you should use `cat $_`.

        Even if you don't think that this is important in your case, it's better to make the code more maintainable and re-usable for security-aware scenarios.

        I cannot agree with that, if only for the reasons it's often not mutually possible. More maintable code usually means simpler code, while code that needs to be run in a possible hostile environment tends to be more complex than code that doesn't have to run in such an environment. "More maintable" and "re-usable for security aware scenarios" are most of the time conflicting requirements.

      Actually, using ls has some advantages. For instance, the opendir/readdir solution presented by jwkrahn below will try to open the current and parent directory as if they were files. A plain 'ls' will not return any names starting with a dot. The equivalent of
      my @files = `ls $dir`;
      is
      my @files; { opendir my $dh, $dir or die; @files = grep {!/^\./} readdir $dh; closedir $dh; }
      As for using cat to read a file, it's something I do often. It's simple. It's a short one liner. Doing it in pure Perl requires several lines, or something cryptic.
      my $info = do {local (@ARGV, $/) = $file; <>}; # Cryptic # 7 Lines. my $info; { open my $fh, "<", $file or die; local $/; $info = <$fh>; close $fh; }

        using ls has some advantages.

        That's why some use glob.

        Doing it in pure Perl requires several lines

        It would take more than 7 lines to do the equivalent of or die $! when using cat. It's so complex you probably don't even bother doing it.

        The OP's code is the perfect example. By using cat,

        • he used three lines instead of two,
        • he removed the error checking he'd do with open,
        • he introduced a lot of overhead in a loop,
        • he introduced a bug that deletes trailing blank lines, and
        • he introduced a bug for files with spaces and other special characters in their names.

        Update: Added OP as an example.

        The equivalent of
        ...
        is
        ...

        No, it is not. ls(1) will call getdents(2) until that doesn't return any more directory entries and finally output the list, while perl's readdir returns control after each call to getdents(2). Depending on the size of the directory to be read and the tasks to be done on each entry, it can make a big difference on distribution of iowait load (which, in sum, will be the same of course). I'm talking about linux here..

Re: greater efficiency required (ls, glob, or readdir?)
by broomduster (Priest) on Aug 27, 2008 at 19:28 UTC
    If the directory is large and you want to do globbing of file names, then ls can be a bad idea. From the shell:

    -> ls | wc -l 61498 -> ls * bash: /bin/ls: Arg list too long
    You wouldn't actually do this to get the full directory listing, of course, but if the list of something*.whatever files that you want is too long, you're still hosed. In that case, you want a readdir / grep combo or glob (which doesn't suffer the same limitation as ls).
Re: greater efficiency required (ls, glob, or readdir?)
by jwkrahn (Monsignor) on Aug 27, 2008 at 18:07 UTC

    For efficiency you want something like this:

    my $dir = '/Path/to/a/data/directory'; opendir my $DH, $dir or die "Cannot open '$dir' $!"; my %hash; while ( my $file = readdir $DH ) { next if $file =~ /~$/; open my $FH, '<', "$dir/$file" or die "Cannot open '$dir/$file' $! +"; while ( my $line = <$FH> ) { next if /^#/ || !/\S/; my ( $key, @values ) = split /\t/; $hash{ $file }{ $key } = \@values; } }

    Update: I realised that the while loop is incorrect, it should be:

    while ( my $line = <$FH> ) { next if $line =~ /^#/ || $line !~ /\S/; my ( $key, @values ) = split /\t/, $line; $hash{ $file }{ $key } = \@values; }

      what about skipping the directory entries?

      my $dir = '/Path/to/a/data/directory'; opendir my $DH, $dir or die "Cannot open '$dir' $!"; my %hash; while ( my $file = readdir $DH ) { next if $file =~ /~$/; next if -d "$dir/$file"; # should also skip +'.' and '..' entries # read and process file }

      update: fix path issue "$dir/$file"

      Thanks, I got your version to work with very few changes.
      my $dir = '/path/to/data/directory'; my %hash; opendir my $DH, $dir or die "cannot open '$dir' $!"; while (my $file = readdir $DH ) { next if $file =~ /~$/; next if -d $file; open my $FH, "<", "$dir/$file" or die "Cannot open '$dir/$file +' $!"; while ( my $line = <$FH> ) { next if /^#/ || !length($line); my ($key, @values ) = split(/\t/, $line); $hash{ $file }{ $key } = \@values; } }
      It even seems to work quite a bit faster than the ls/cat combo.

        You need to change: next if -d $file; to next if -d "$dir/$file";

        You need to change: next if /^#/ || !length($line); to next if $line =~ /^#/ || !length($line);

Re: greater efficiency required (ls, glob, or readdir?)
by kyle (Abbot) on Aug 27, 2008 at 19:20 UTC

    For fun, I did a rewrite. I haven't tested it, but it compiles cleanly.

    As to your question, I think a grep on readdir would be the best way to go to get your list of files. The files themselves, you could process line-by-line instead of reading every line at once, and that might be better.

    You might be able to get the shell to do even more of the work for you, though.

    my $dir = "/Path/to/a/data/directory"; my %hash; open my $grep_fh, '-|', "grep '^' $dir/* /dev/null" or die "Can't grep: $!\n"; while ( my $line = <$grep_fh> ) { $line =~ s/^([^:])://; my $file = $1; next if $file =~ /\~$/; %{$hash{$file}} = (%{$hash{$file}}, %{hashify_line( $line )}); } close $grep_fh or die "Error closing grep pipe: $!\n";

    This way, you get grep and the shell to do all the I/O.

    Notes:

    • I hope you don't have any filenames with a colon in them.
    • This uses sub hashify_line as defined in the <readmore> above. (Hey, refactoring pays off!)
    • We assume also that $dir does not contain any shell metacharacters. If yours isn't really a literal as in your example, you may have to sanitize it.
    • You could probably get grep to do some of your line filtering for you, but I'd just as soon do that in Perl.
    • Likewise, you could use find and xargs to choose the file list to pass to grep, and I'd really rather do that in Perl.
    • Both of those "rather do that in Perl" statements may need to be reevaluated in light of performance problems. (For example, if you waste a lot of time ignoring lines.)

    Update: broomduster makes a good point in Re: greater efficiency required (ls, glob, or readdir?) also. If you have too many files, the "$dir/*" to the shell is going to bomb. Time for xargs, then. Something like:

    my $cmd = "find $dir -type f -print | xargs grep '^' /dev/null"; open my $grep_fh, '-|', $cmd or die "Can't grep: $!\n";

    Then you may have to filter out dot files somewhere.

Re: greater efficiency required (ls, glob, or readdir?)
by superfrink (Curate) on Aug 28, 2008 at 04:48 UTC
    ls sorts the list of files by default. If you don't care about the order of the files you can shave off a bit of time. ls -f will turn off sorting.

    The difference you see will depend on your filesystem and you will have to try it out to see how much faster it is. In my experience there is a small difference with ReiserFS but there is a noticeable difference with Solaris UFS.
      ls -f will turn off sorting.

      The man page for ls on my linux system states that ls -f is equivalent to enabling -a (among other things). This would change the behaviour regarding plain ls so I would recommend using ls -U instead.

      --
      David Serrano

Re: greater efficiency required (ls, glob, or readdir?)
by jethro (Monsignor) on Aug 27, 2008 at 18:19 UTC

    || !$_ is possibly unsafe, you would ignore a file named '0'

    Also usually 'or' is better than '||' because of lower precedence (doesn't make a difference here though)

      || !length($_) (Upd: or better yet || !/\S/ ) would be the solution.

      || is fine. It's being used to join expressions, not statements.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://707234]
Approved by varian
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (9)
As of 2020-07-10 13:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?