Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re (tilly) 1: I'm falling asleep here

by tilly (Archbishop)
on Oct 21, 2001 at 05:02 UTC ( [id://120339]=note: print w/replies, xml ) Need Help??


in reply to -s takes too long on 15,000 files

Welcome to a real world demonstration of why algorithm efficiency matters. :-)

The likely reason why stat is so slow is that each stat has to traverse the entire directory linearly to find the file that you are interested in. That means that you are going back to the directory structure 10,000 times, each time scanning on average about 5,000 entries to find the file of interest. The resulting 50,000,000 fetches of information is what is taking the time.

You can verify this by printing every 100'th filename. You should see a progressive slowdown as the stats slow down due to having to scan more and more previous files.

You would speed this up by a factor of 10 if you arranged to have 10 directories of 1000 files each. You would speed up by a much larger factor with a filesystem which was designed to handle directories with many small files. (Think ReiserFS on Linux.)

The ideal answer, of course, would be to have your direct pass through the directory structure pull back not just the name, but also the associated metadata. That is what ls and dir do, and it is why they are so much faster. Unfortunately Perl's API doesn't give you direct access to that information.

An incidental note. Using Perl does not guarantee portable code. For instance your use of lc will cause you porting problems on operating systems with case-sensitive filesystems (like Unix). (You will be statting a different file than you saw, in fact likely one that isn't there.)

Replies are listed 'Best First'.
Re: Re (tilly) 1: I'm falling asleep here
by Fletch (Bishop) on Oct 21, 2001 at 05:31 UTC

    No, ls does do calls to stat (and/or lstat). Here's a chunk of the output from running truss ls -l . on FreeBSD:

    $ truss ls -l |& grep stat ... lstat("Makefile",0x809d24c) = 0 (0x0) lstat("cmp.c",0x809d348) = 0 (0x0) lstat("extern.h",0x809d44c) = 0 (0x0) lstat("ls.1",0x809d548) = 0 (0x0) ...

    The output will look very similar on other OSen as well. Depending on the implementation of ls it's either calling opendir and stat itself, or it's using a library (fts(3) on FreeBSD) to traverse things. But underneath they're all doing the same thing and making the same system calls.

    You are correct that many Unix filesystems don't cope well with large numbers of entries since it is a linear scan to locate something within a given directory (more than a few hundred may show performance problems, something over 1k will more than likely do so).

    A solution to this (if you can make changes to the way things are structured on disk but can't change to a fancier filesystem) is to use the initial characters of each entry as a key and have a second level hierarchy of subdirs for each key. For example if your files are named with hexadecimal numbers, you might have a top level directory with dirs `00' and `01' through `fe', `ff'. A file named `ace0beef' would be located in `toplevel/ac/ace0beef'. If those first directories start getting crowded, just add another layer of subdirs underneath each of those. As long as you abstract out the mapping you can change the underlying storage hierarchy without having to alter your programs.

      The example code was running on Windows, not Unix.

      However your other points are true. I admit that I am guessing as to how the Windows dir function is running so much faster than the simple Perl shown.

      But one note. It may be that your filenames don't divide well based on the first few characters. (So one directory has a ton of directories, the rest do not.) In that case the above scheme can be improved by first taking an MD5 hash of the filename, and then placing files into directory locations depending on the characters in the MD5 hash.

      (-:At which point your on-disk data storage is starting to be a frozen form of efficient data structures you might learn about in an algorithms course...:-)

        If only it was that simple... it just needs to be all in one directory for another program that parses them (that I didn't write).

        Like I said, ls on Linux takes about 10 seconds for all those files, so it still leaves me guessing.

        I'm also thinking MD5 is a bit of an overkill on 15000 * rather short filenames, a simple rotating hash would be quicker. And more efficient.

        =)

Re: (tilly) I'm falling asleep here
by ishk0 (Acolyte) on Oct 21, 2001 at 05:19 UTC

    I would have realised that lc was out of place when my code would fail at university tomorrow morning. =)

    So it looks like parsing until I can write an XS module for it. Thanks very much for the explanation.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://120339]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2024-04-23 14:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found