Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Re^5: Perl solution for storage of large number of small files

by jbert (Priest)
on Apr 30, 2007 at 14:31 UTC ( #612784=note: print w/replies, xml ) Need Help??

in reply to Re^4: Perl solution for storage of large number of small files
in thread Perl solution for storage of large number of small files

Often-accessed data will stay in memory whether it is accessed via read() or mmap(). mmap() can be a more convenient interface, precisely because of the opposite effect, data on disk mapped by mmap() *isn't* automatically brought into memory until it is used, and then only the bits which are needed are brought in (subject to 4k page granularity). Whereas a successful read() will always bring the data to memory.

This means you are perhaps less likely to have unwanted data in memory, but that's more to do with it taking more code to do the read() approach well than because mmap()'d data is more likely to stick in memory.

The kernel might trigger different heuristics for the two different methods of access (such as readahead if you do a number of sequential reads or a big sequential memory access to an mmap'd area), but I'm not even sure of that - they might go through exactly the same code paths.

I'd say that the biggest difference is the results of a read() are normally copied into a per-process buffer in the application, whereas multiple processes can in principle share the same copy of mmap'd data.

  • Comment on Re^5: Perl solution for storage of large number of small files

Replies are listed 'Best First'.
Re^6: Perl solution for storage of large number of small files
by andye (Curate) on Apr 30, 2007 at 14:51 UTC
    Hi jbert,

    So can you read() a file that's bigger than memory? I thought you couldn't... hence mmap.

    Best wishes, andye

      You *can* do a single read that's bigger than your available RAM, but that's not what I meant.

      If you want to access data in a file thats larger than your available RAM, you'll basically only be working on part of a file at a time, however you go about it. You'll need something to move parts of the file into and out of memory as you go.

      One option is to use mmap. Your memory access patterns will then determine which pages the OS faults into your process and which are discarded by the LRU.

      You can also use read(). You'll get very similar benefits of caching from the OS, but you'll have to do the "getting data into memory" bit yourself more explicitly.

      mmap has it's place and is useful, but I've often come across people who do things like "we'll keep an in-memory cache of recently-used files to avoid having to read them from disk each time", or "we'll use a RAM disk for these files", not realising that if their guess of recently-used is accurate then they don't need to do that, since the OS will make sure the data in those files stays in memory (and if it's inaccurate then they're wasting memory which could be put to better using caching the genuinely frequently used stuff).

      In one particular case I saw, the file cache was per-process, so replicated across 60 or so procs on the box, wasting a significant amount of memory (which was a precious resource on the box in question).

      So sorry for picking up on this but I just think that many people don't seem to understand that read() can be entirely satisfied from RAM, and will be for a commonly-accessed file (and assuming noatime on the mount point on the box).

      Your use of mmap seems perfectly sensible to me, but for reasons of coding simplicity, not because "So memory mapping meant that the often-access data stayed cached in ram". That benefit also applies to read().

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://612784]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2022-08-08 07:17 GMT
Find Nodes?
    Voting Booth?

    No recent polls found