Efficient processing of large directory

Elliott has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Efficient processing of large directory (n-tier directories and the Grubby Pages Effect)
by grinder (Bishop) on Oct 02, 2003 at 17:16 UTC

By definition, one cannot process a large directory efficiently :)

Find a way to key the file names so that you can move to a multi-level directory structure. This might be the first two characters or last two characters of the filename:

filename	key	result

dabhds.txt	d	d/dabhds.txt
xyzzy.txt	x	x/xyzzy.txt

43816.txt	16	16/43816.txt
73813.txt	13	13/73813.txt

The main point to remember is that given the filename, you can derive the directory it should be in. And if it gets moved to the wrong directory, you can check for it programmatically.

I worked on a web site like yours once. When I was called in, there were more than 600 000 files sitting in one directory. This was on a Linux 2.0 kernel on an ext2 filesystem. The directory entry itself was over 3 megabytes. Needless to say performance suffered...

We managed to get it into a three-level directory structure (7/78/78123.txt) but it tooks hours of time at the kernel level because the directory traversal was so slow.

Bite the bullet and reorganise your files, before it's too late!

Note that the filenames are numeric (1232.txt etc. etc.) then you want to key on the last digits, not the first digits, otherwise you'll skew the number of files per directory to the low-numbered ones because of the Grubby Pages Effect (more links here),

[reply]

Re: Efficient processing of large directory
by MidLifeXis (Monsignor) on Oct 02, 2003 at 17:25 UTC

In addition to the while suggestion above, depending on the filesystem you use, your suggestion to hash the directories can be a good one, especially if the system has to read multiple disk blocks to find the file you are looking for.

The same concept has been applied to mail spools (qmail), and suggested to help speed up access to home directories on hosts with large numbers of "users".

With the number of files you are considering, you probably want to consider something along the lines of NUMFILES < SPLIT ** DEPTH where SPLIT is the number of subdirectories that can fit in one disk block, and DEPTH is how deep your directory structure should go. Once you get to the point where NUMFILES is larger, then you start needing multiple directory reads to find the file you need to open.

Add this to the while suggestion above, and you should be able to access each individual file (such as an open() call) as quickly (Update: adj -> adv) as the OS can handle it.

Of course, this is all IIRC, and it has been a while since I have applied this in my studies.

Now this is all based on older file systems (inode, UFS style, chained directory block style, etc). The newer filesystems (btree, reiser?) may not have this "problem" anymore.

Update: Fixed spelling mistakes

[reply]
[d/l]
[select]

Re: Efficient processing of large directory
by dws (Chancellor) on Oct 02, 2003 at 17:19 UTC

It worked fine up to its planned limit of around 3000 files, but it's been too successful and the client now has 17,000 files in there!

A side note first: Some operating systems are really, really bad with directories that large. If your client is interested in performance, they (or you, on their behalf) may want to do a bit of performance prototyping. Recent work on FreeBSD has greatly improved its large directory performance, for whatever that's worth.

For your problem, I see two options: The first is to use File::Find to locate all *.txt files, and process them one-by-one. The second is to use opendir()/readdir()/closedir() to read the directory directly, filename by filename. Either one will avoid you having to hold on to a large temporary array.

You can find plenty of examples of each by using Super Search to look for "File::Find" or "opendir".

[reply]
[d/l]

Re: Re: Efficient processing of large directory

by BrowserUk (Patriarch) on Oct 03, 2003 at 00:10 UTC

It's worth noting that if your trying to find a subset of the files contained in a subdir, rather than processing them all, then using <*.txt> is considerably faster that using either File::Find or opendir/readdir/closedir. At least that is the case under Win32 as the wildcard matching is done by the OS and only those files matching are past back.

In the examples below, the first comparison shows selecting all 17576 files in a subdirectory. In this case, glob and File::Find come out pretty much even.

In the second comparison, a subset of 676 files is selected from the 17000 using a wildcard. In this case, the glob runs 650% faster as it is only processing the 676, rather than looping over the whole 17000+.

Of course, if any real processing was being done rather than just counting the files, the difference would rapidly disappear.

In this case, the OP's use of the word "efficient" was most likely to do with the memory used by slurping all 17000 names in to memory rather than speed, but if not all those 17000 file are .txt files, the time saved might be worth having.

Read more... (1371 Bytes)

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.

[reply]
[d/l]

Re: Efficient processing of large directory
by BrowserUk (Patriarch) on Oct 02, 2003 at 16:43 UTC

Use while instead.

while( my $file = <directory/*> ) {
    # do stuff
}
[download]

Examine what is said, not who speaks.

[reply]
[d/l]

Re: Re: Efficient processing of large directory

by Elliott (Pilgrim) on Oct 02, 2003 at 16:47 UTC

Thanks for the tip - but why? (Most of all I want it to work - but I also want to understand)

[reply]

Re: Re: Re: Efficient processing of large directory

by dragonchild (Archbishop) on Oct 02, 2003 at 16:54 UTC

foreach my $n (<*.*>)
{
    # do stuff
}

----

my @list = <*.*>;
my $i = 0;
while ($i <= $#list)
{
    my $n = $list[$i];
    # do stuff
}
continue
{
    $i++;
}
[download]

------
We are the carpenters and bricklayers of the Information Age.

The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

[reply]
[d/l]

Re: Re: Re: Re: Efficient processing of large directory

by BUU (Prior) on Oct 02, 2003 at 17:51 UTC

�Re: Re: Re: Re: Re: Efficient processing of large directory

by merlyn (Sage) on Oct 02, 2003 at 18:54 UTC

Re5: Efficient processing of large directory

by dragonchild (Archbishop) on Oct 02, 2003 at 17:53 UTC

Re: Re: Efficient processing of large directory

by Elliott (Pilgrim) on Oct 03, 2003 at 15:37 UTC

while

Looks like I'd better try subdirectories too.

[reply]
[d/l]

Re^3: Efficient processing of large directory

by Aristotle (Chancellor) on Oct 03, 2003 at 22:19 UTC

readdir

Makeshifts last the longest.

[reply]

Re: Re^3: Efficient processing of large directory

by Elliott (Pilgrim) on Oct 05, 2003 at 15:57 UTC

Re: Re^3: Efficient processing of large directory

by BrowserUk (Patriarch) on Oct 04, 2003 at 00:11 UTC

Re^5: Efficient processing of large directory

by Aristotle (Chancellor) on Oct 04, 2003 at 07:35 UTC

Re: Efficient processing of large directory
by jdtoronto (Prior) on Oct 02, 2003 at 17:28 UTC

The problem with timeouts can also be the Apache. I am not sure of the exact mechanism. But when I process large directories I send something to the server regularly to 'keep it awake' and keep it outputting - and thus not timing the CGI process out. That way I have single CGI scripts that run sometimes for 12 or 13 hours and work nicely.

The other alternative is to have the CGI script launch the script doing all the work, in this way the timeout is no longer an issue, the script can do its work quietly in the background.

Hope that helps...

jdtoronto

[reply]

Re: Efficient processing of large directory
by liz (Monsignor) on Oct 02, 2003 at 18:50 UTC

... Or is there a much better way to do this?...

If you have the possibility of re-formatting a partition, have a look at ReiserFS. It will do the "hashing" of filenames for you and be much more efficient in handling diskspace.

Liz

[reply]

Re: Efficient processing of large directory
by Elliott (Pilgrim) on Oct 03, 2003 at 13:36 UTC

Reorganise into subdirectories as I originally thought
Use while instead of foreach

while

BTW, the file names are email addresses (opt-in list, no spam here I promise!!) with \W characters removed. I was planning to pick 2nd and 4th chars to name the subdirectories in order to avoid grubbiness. Any thoughts on that?

[reply]
[d/l]
[select]

Re: Re: Efficient processing of large directory

by tilly (Archbishop) on Oct 05, 2003 at 04:04 UTC

DB_File

But do keep text backups in case an upgrade breaks DB_File.

[reply]


go ahead... be a heretic
	PerlMonks