Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Re: Perl solution for storage of large number of small files

by jbert (Priest)
on Apr 30, 2007 at 09:15 UTC ( #612720=note: print w/replies, xml ) Need Help??

in reply to Perl solution for storage of large number of small files

My first suggestion would be to go with the filesystem, opening up levels of subdirectory as needed. (My second note would be that if this is an email storage system, there is a lot of prior code on this, check out some of the free IMAP servers).

Note that not all filesystems are created equal. They each make different tradeoffs and some will have much better performance for this load than others.

A journalling filesystem should perform better for a write-heavy load than a non-journalling fs. Disks are fastest when you are writing (or reading) lots of sequential data. File creation and updates will go to the (hopefully sequential) journal and should help a lot. If you simultaneously have a heavy read load, you'll lose a lot in performance - due to seeking - unless you can satisfy most reads from cache (which is, for example, the case in an MTA email queue under most sensible queueing strategies).

Your measurements showing that writing larger files is quicker is surprising. Can you try a 'strace' on the two cases in question to see if the application code is doing anything different in the two cases?

Can you tell any more about the application and expected access patterns? It sounds interesting.

  • Comment on Re: Perl solution for storage of large number of small files

Replies are listed 'Best First'.
Re^2: Perl solution for storage of large number of small files
by isync (Hermit) on Apr 30, 2007 at 10:40 UTC
    Actually I am thinking about completely switching over to the filesystem-only approach and stop toying around with this data-buckets idea. BTW: What is the maximum number of files on my ext3?

    I've got a client-server architecture of scripts here - no emails, no imap server... It's a research project. (But challenges seem similar - thanks for the hint!)
    A data-generation script gathers measuring data and produces the 40K-120K pakets of output, while a second script takes this output and makes sense of it thus enriching the meta-data index. Both scripts are controlled by a single handler which keeps the meta-data index and stores the data-pakets (enabling us to run everything in clusters). And that handler is where the bottleneck is. So I am thinking about taking off the storage part from the handler and let the data gatherer write to disk directly via NFS.

    NFS was also the solution in the "larger files is quicker" paradox. My development machine tied a hash via NFS which resulted in this. Now, actually running the script on the server told me that the tie is always fast. The insert is fast most of the time (although every few cycles, when DB_File expands or so, it slows down..). But the untie takes forever on growing files...

    The expected access pattern is mostly plain storage (gatherer), followed by one read on every stored paket (sense-maker). Then every few days an update/rewrite on all pakets involving possible resizing(gatherer again).

    The "new toy" idea is now to use a set of disks tied together via NFS(distributed) or LVM(locally), mounted on subdirs building a tree of storage space leaves (replacing my few-files approach).
      The maximum number of files on a filesystem is limited by the number of inodes allocated when you create it (see 'mke2fs' and the output of 'df -i'). You can also tweak various settings on ext2/ext3 with tune2fs.

      As you probably already know, written data is buffered in many places between your application and the disk. Firstly, the perlio layer (and/or stdio in the C library) may buffer data - this is controlled by $| or the equivalent methods in the more modern I/O packages.

      Flushing will ensure the data is written to the kernel, but it won't ensure the kernel writes it to disk. You need the 'fsync' system call for this (and/or the 'sync' system call). You can get access to these via the File::Sync module.

      Note that closing a filehandle only *flushes* it (write userland buffers), it does not *sync* it (write kernel buffers).

      (If you're paranoid and/or writing email software, you may also want to note that syncing only guarantees that the kernel has successfully written the data to the disk. Most/all disks these days have a write buffer - there isn't a guarantee that data in that write buffer makes it onto persistent storage in the event of a power failure. You can get around this in various ways, but I'm drifting just a bit out of scope here...)

      The above is to suggest an explanation for 'untie' taking a long time (flushing lots of buffered data on close), and it's also something anyone doing performance-related work on disk systems should know about. In particular, it may suggest why sqlite seemed slow on your workload. For robustness, sqlite calls 'fsync' (resulting in real disk I/O) at appropriate times (i.e. when it tells you that an insert has completed).

      (Looking at one of your other replies...) If you are writing a lot of data to sqlite, you'll probably want to investigate the use of transactions and/or the 'async' mode. By default, sqlite is slow-but-safe. By default, writing data to a bunch of files is quick-but-unsafe. (But both systems can operate in both modes, you just need to make the right system calls or config options).

      If you're going to be doing speed comparisons between storage approaches, you need to be sure of your needs for data integrity and then put each storage system into the mode that suits your needs before comparing. (You may well be doing all this already - apologies for the length response if so).

        Actually, thank you for the lengthy reply!

        I already learned about sqlite's async mode, but was too lazy to recompile it and just switched the design to in-memory (sqlite was used only on the index part - I am not such a big fan of binary data in databases yet..)
        Pooling updates/writes (as in your transactions hint) was planned to streamline sqlite, but I pulled the plug on this when I opted for the in-memory approach.

        Thanks for all your help guys! Until I need to handle more than 25,000,000 files, plain fs will do (without re-inventing the wheel..)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://612720]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (2)
As of 2022-08-18 05:43 GMT
Find Nodes?
    Voting Booth?

    No recent polls found