Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: (OT) should i limit number of files in a directory

by tilly (Archbishop)
on Sep 11, 2008 at 16:15 UTC ( [id://710671]=note: print w/replies, xml ) Need Help??


in reply to (OT) should i limit number of files in a directory

The reason that PAUSE does that is that many filesystems use some variation on scanning a linked list for the directory entries. Therefore you really want to avoid having a single directory with hundreds of thousands of files in it.

However you say you are on ext3. That filesystem uses an htree balanced tree for large directories, so it is internally already doing what you'd be trying to do.

That said, merlyn is right. There is a lot of hidden overhead to having a small file in the filesystem. If you just want to record the existence of a md5 hex sum digest, that is a perfect application for a database, BerkeleyDB, or DBM::Deep.

  • Comment on Re: (OT) should i limit number of files in a directory

Replies are listed 'Best First'.
Re^2: (OT) should i limit number of files in a directory
by leocharre (Priest) on Sep 11, 2008 at 16:27 UTC

    I do have a database keeping track of sums and using ids.

    I am not using this system merely to check existance. The files actually hold something. Data that does not belong in a database, as it is.

    It makes sense what merlyn and other said about storing in a database.
    Let's not forget that the filesystem *is* a form of database system. It's a data storage discipline.
    Some things are more appropriate on a fs then a db server.

    A million text files ranging in size from 1k to 486k etc.. would probably cripple a db system- it's too much of a variation.. maybe i'm wrong about that.

    There's no searching, no comparing, the size of each element is wildly varied... It feels like a fs thing..

      In your original post you said that you were just using the filename to check existence. If it has data, then a file is more reasonable. However I would still suggest looking at something like DB_File's interface to Berkeley DB.

      That's designed to store data of exactly this type. Its data limits are 4 GB per entry, and 256 terabytes for the entire dataset.

      If you want to store the data on one system and use it on another, then you might want to move up to a database. Sure, there are things like NFS. But if someone goes innocently looking at a directory like that using standard tools over a networked filesystem and you'll be putting everything through an "interesting" stress test. Plus even though it works today on ext3, that's no guarantee that in 2 years someone won't migrate the system to another system and not understand that that directory really, really needs to be a specific filesystem.

      While I agree that there are things that belong on filesystems, this feels to me like something that would be happier not living on a filesystem. But if you put it there, then I'm going to suggest that your disks will be happier if you turn off maintenance of last access time in that directory. That information is almost never used, and causes every read of a file to write to the directory. If you're under load this can be a significant cause of overhead.

      I still think merlyn is right -- a db is the way to go. blobs are not the most elegant/efficient mechanisms, but they are very easy to find based on a key. As long as your blobs stay below about 1MB, mysql or postgres should be fine. Trying to find a single file in a directory hierarchy of millions of entries is going to suffer significantly worse performance.
        What about a berkeley db ? That seems like it would be good backend here?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://710671]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-04-26 02:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found