Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re^4: Perl solution for storage of large number of small files

by andye (Curate)
on Apr 30, 2007 at 12:06 UTC ( #612754=note: print w/replies, xml ) Need Help??

in reply to Re^3: Perl solution for storage of large number of small files
in thread Perl solution for storage of large number of small files

Hi isync and salva, interesting topic.

Anyway, if you need to access 2GB of data randomly, there is probably nothing you can do to stop disk trashing other than adding more RAM to your machine, so that all the disk sectors used for the database remain cached.

In this situation - more data than memory, but not loads more - I've found memory mapping works well. In my situation the data accesses were randomly scattered but with a non-uniform distribution - if that makes sense. I.e. although the access wasn't sequential, some data was accessed more often than others. So memory mapping meant that the often-access data stayed cached in ram.

Any decent database should be able to do pretty much the same thing - as long as you configure it with a big query cache - although disk access will be slower than for memory mapping.

The real problem comes if you're making a lot of changes to the data, which busts your cache...

Best wishes, andye

  • Comment on Re^4: Perl solution for storage of large number of small files

Replies are listed 'Best First'.
Re^5: Perl solution for storage of large number of small files
by jbert (Priest) on Apr 30, 2007 at 14:31 UTC
    Often-accessed data will stay in memory whether it is accessed via read() or mmap(). mmap() can be a more convenient interface, precisely because of the opposite effect, data on disk mapped by mmap() *isn't* automatically brought into memory until it is used, and then only the bits which are needed are brought in (subject to 4k page granularity). Whereas a successful read() will always bring the data to memory.

    This means you are perhaps less likely to have unwanted data in memory, but that's more to do with it taking more code to do the read() approach well than because mmap()'d data is more likely to stick in memory.

    The kernel might trigger different heuristics for the two different methods of access (such as readahead if you do a number of sequential reads or a big sequential memory access to an mmap'd area), but I'm not even sure of that - they might go through exactly the same code paths.

    I'd say that the biggest difference is the results of a read() are normally copied into a per-process buffer in the application, whereas multiple processes can in principle share the same copy of mmap'd data.

      Hi jbert,

      So can you read() a file that's bigger than memory? I thought you couldn't... hence mmap.

      Best wishes, andye

        You *can* do a single read that's bigger than your available RAM, but that's not what I meant.

        If you want to access data in a file thats larger than your available RAM, you'll basically only be working on part of a file at a time, however you go about it. You'll need something to move parts of the file into and out of memory as you go.

        One option is to use mmap. Your memory access patterns will then determine which pages the OS faults into your process and which are discarded by the LRU.

        You can also use read(). You'll get very similar benefits of caching from the OS, but you'll have to do the "getting data into memory" bit yourself more explicitly.

        mmap has it's place and is useful, but I've often come across people who do things like "we'll keep an in-memory cache of recently-used files to avoid having to read them from disk each time", or "we'll use a RAM disk for these files", not realising that if their guess of recently-used is accurate then they don't need to do that, since the OS will make sure the data in those files stays in memory (and if it's inaccurate then they're wasting memory which could be put to better using caching the genuinely frequently used stuff).

        In one particular case I saw, the file cache was per-process, so replicated across 60 or so procs on the box, wasting a significant amount of memory (which was a precious resource on the box in question).

        So sorry for picking up on this but I just think that many people don't seem to understand that read() can be entirely satisfied from RAM, and will be for a commonly-accessed file (and assuming noatime on the mount point on the box).

        Your use of mmap seems perfectly sensible to me, but for reasons of coding simplicity, not because "So memory mapping meant that the often-access data stayed cached in ram". That benefit also applies to read().

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://612754]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2022-08-18 05:02 GMT
Find Nodes?
    Voting Booth?

    No recent polls found