http://qs321.pair.com?node_id=692348

voeckler has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am looking for a way to increase the 4k size that strace shows read() will do at system level. Yes, I've read Re^3: Perl Read-Ahead I/O Buffering and I kindly disagree that 4k is enough for everybody. After all, in C you can use setvbuf to set your buffer sizes, and in C++ you can use a complex streambuf statement to increase the buffer size. Occasionally you have situations where a larger buffer makes sense.

Motivation:

While Stevens's APUE, chapter 3, does show no significant improvements for buffers over 4k, this data is likely 20 years old. My current disk space approaches 0.1 PB. The file system uses 64k size blocks. Worse, the persistent storage is delivered via NFS (albeit inodes and data are served from different physical machines, and multiple such pairs for various mount points). Occasionally, I need to read large files that exceed the local compute node disk space, so I am forced to read them from NFS.

Now, each read() that shows up in strace will incur 1 NFS request. If I have a 8GB file that I read in 4k chunks in 200 parallel processes, my computations will issue 400 million NFS requests. As you can imagine, the few other users with whom I share the fs are angry with me for slowing the servers to a crawl (ever waited 1 min for a "cd some/where"), and the admins are decidedly unhappy, too. The admins actually suggested to read the file in 8M chunks, which would issue only ~200000 NFS requests. Of course, my solution was to copy that file to the local disk for the compute node, and then compute from there. But sometimes I have files that are larger than the local disk scratch space, and I am thus forced to read directly from NFS.

Since PerlIO's setvbuf has been disabled, I wonder, how do I set a larger read buffer size in Perl, so that the read()s as seen by strace are using more than 4k? Even if it is not making my Perl programs run faster, it would make the NFS server experience less load, and thus frustrate the admins less, who will have to deal with annoyed users.

I've trolled the web for some time, and couldn't really find an applicable solution how to increase Perl's read buffers sizes. I've written a FullyBuffered module using sysreads within an object, by-passing regular PerlIO, but it feels slow, and does not integrate nicely with PerlIO handles, e.g. occasionally, I do need the utf8-layer. I'd be loathe having to recompile my Perl to make the default read buffer size larger, though I would be willing to do, with good instructions, if that is what it takes.

I'd really appreciate some insight into increasing the read buffer size.

Thank you,
Jens.

Replies are listed 'Best First'.
Re: 4k read buffer is too small
by almut (Canon) on Jun 16, 2008 at 21:52 UTC

    AFAIK, stdio buffering - as configurable via setvbuf - is incompatible with PerlIO's buffering, which is why it's disabled when you configure Perl to use PerlIO.  OTOH, you most probably do want PerlIO... so configuring/rebuilding Perl to not use it, isn't really an option.

    Anyhow, a little digging around suggests that you can "configure" PerlIO's buffer size in the file perlio.c:

    STDCHAR * PerlIOBuf_get_base(pTHX_ PerlIO *f) { PerlIOBuf * const b = PerlIOSelf(f, PerlIOBuf); PERL_UNUSED_CONTEXT; if (!b->buf) { if (!b->bufsiz) b->bufsiz = 4096; /* <--- here */ b->buf = Newxz(b->buf,b->bufsiz, STDCHAR); if (!b->buf) { b->buf = (STDCHAR *) & b->oneword; b->bufsiz = sizeof(b->oneword); } b->end = b->ptr = b->buf; } return b->buf; }

    At least, I changed that 4096 to 8192, recompiled perl (v5.10.0), and now strace reveals that read(2) is being called for blocks of size 8192, when you execute something like

    open my $fh, "<", $^X or die; while (<$fh>) { }

    while before the change, read blocks were of size 4096.

    Other than that, I haven't done any testing yet. So, no guarantees whatsoever (!) that it'll work in every respect... — just something to play with at your own risk.  Good luck!

      Thank you, this sounds like what I was looking for. I was poking at the Perl code today. I will try this tomorrow.

      PS: Do you think the Perl gods will make a buffer setting function available again in PerlIO? After all, C has setvbuf and C++ has myistream.rdbuf()->pubsetbuf(buf,bufsize) to let the user override defaults, if he so choses.

        Do you think the Perl gods will make a buffer setting function available again in PerlIO?

        I can't really speak for the Perl gods, but considering that the configurability of the buffer size currently is near the lowest conceivable level1, I'd think that making it user-settable (à la setvbuf with stdio) isn't prioritized very high at the moment.

        You might want to bring the issue up on p5p, however... if you feel determined and are well prepared with good arguments :) — I do remember having come across a related discussion (last time I felt like needing setvbuf myself), but unfortunately, I can't find it at the moment2. I recall I did sense some reluctance to change in the overall tone of the thread...

        ___

        1 "configurability levels" that I could think of:

        • (1) hardcoded magic constant in the code
        • (2) macro/constant (system-dependent) automatically determined during configure
        • (3) compile-time configure option
        • (4) user-configurable global runtime option affecting all buffers (switch, env-var, magic Perl var, whatever)
        • (5) user-configurable runtime option per IO handle (like setvbuf)
        • (6) user-configurable runtime option per PerlIO layer
        • (7) like (6), but dynamically reconfigurable on open/unflushed handles

        2 googling the p5p archives - i.e. 'setvbuf site:www.xray.mpe.mpg.de' - doesn't produce any hits, although there are definitely some mentions of setvbuf  (presumably some restrictive robots.txt file)

      That code surprises me. I would have at least expected it to be equal to the page size. And that varies with architecture. On Alpha, for instance, it's 8K.
        It surprises me more that there's a magic number like that buried down in the core. It appears that you should be able to configure that in your own custom IO layer and set the size as big as you wish.

        ACK: I would have expected a getpagesize() call, since pages are often natural boundaries. Or at least a reference to a BUFSIZ that many stdio's define - which happens to be, after several indirections, come to 8k on my x86_64 Linux.

Re: 4k read buffer is too small
by graff (Chancellor) on Jun 17, 2008 at 02:22 UTC
    I'm curious what sort of trials lead you to say that using sysread "feels slow"... How slow does it "feel" compared to the default i/o methods? (How much do your colleagues notice your presence when you use sysread, as opposed to the default methods?)

    Since almut has already pointed out how to build perl 5.10 with your own custom input buffer size, that's likely to be the way to go -- a specific build of perl for this specific app...

    But I'd still be tempted to try a little more with the sysread approach (esp. since you seem to have made some progress with it already), and as for missing PerlIO's utf8 layer, well, you do still have Encode, which basically does the same thing.

    And I think it's worthwhile to consider starbolin's comment about improving the use of local disk in your optimization strategy -- in addition to anything else you do. Whatever solution you pick should probably include making a one-time copy of big data chunks to local disk, if only to keep your process from stalling everyone else on the network.

    (If the process happens to be modifying or rewriting file contents, all the more reason, perhaps, to work on local storage until the process is done, then "upload" the finished product to your network drives. NFS writes are more expensive than NFS reads, so the less you do NFS writes, the better.)

    update: To follow up on my remark about Encode, this snippet produces no warnings about "wide character data", and outputs the appropriate 3-byte sequence (two-byte utf8 sequence for á followed by LF):

    perl -MEncode -e '$_=encode("utf8","\x{e1}\n"); syswrite(STDOUT,$_)'
    Doing the same thing on input involves passing your input string as the 2nd arg to decode("utf8",...) -- the return value is a perl utf8 string.

      I agree that the data should go to the node's scratch, the processing happens, and results are uploaded to NFS again. Never mind NFS, RAID5 is also not helping writes. However, sometimes files become so ridiculously large that the local scratch does not suffice, and I am forced to work off NFS - though I still try to put the products on scratch, and upload them to NFS afterwards.

      I did write a simple FullyBuffered module basically doing sysopen, sysread into a large buffer, maintaining a cursor (to avoid unecessary string copies), etc. I was timing this against the original script using Perl's IO, and it performed, to my surprise, a little worse. The Perl IO version takes about 3 minutes for 2^20 lines, my fully buffered approached 4 minutes for 2^20 lines (dang, I tossed its log file). Of course, I do suspect that I am doing something stupidly inefficient.

      I thank you very much for showing me the proper utf8 conversions. Should I continue with my module approach, it will come in handy.

Re: 4k read buffer is too small
by quester (Vicar) on Jun 17, 2008 at 07:07 UTC
    When sgifford mentioned tcpdump it reminded me: on a normal Ethernet segment, the MTU (maximum size of a packet) is customarily set to 1500 bytes. You could probably make the internal workings of NFS more efficient by raising the MTU, which will reduce the number of packets.

    Be wary of the following, though:

    (1) routing packets from a circuit with a large MTU to a circuit with a smaller one can cause occasional odd problems; for example distant web sites behind firewalls that block ICMP "path MTU exceeded" messages may not be able to send you pages more than 1500 bytes long any more. You may need to keep the NFS between your client and the servers on an isolated network to avoid this kind of problem. The client and servers could have interfaces to other networks in order to talk to your other equipment, as long as traffic isn't routed between the networks.

    (2) You need jumbo or giant frame support, which is only common on Gigabit and faster Ethernet. Note that there is no vendor-independent standard for exact how big a jumbo frame can be, but Cisco suggests 9216 bytes.

    It's very difficult to generalize about how much jumbo packets really help, because there is so much variation in how much of the overhead of breaking up data into multiple packets and then reassembling it can be offloaded onto dedicated hardware. But if you have the appropriate switches and NIC cards, it might be worth a quick benchmark.

    As a starting point, HP ran a benchmark of 9000 vice 1500 byte MTU on GigE, and showed around 43% better throughput and around 27% less CPU on the receive side and 43% less CPU on the transmit site.

Re: 4k read buffer is too small
by starbolin (Hermit) on Jun 16, 2008 at 20:30 UTC

    voeckler writes:

    ... I kindly disagree that 4k is enough for everybody.
    I don't mean to be difficult but that's not what graff said. What he said was 4k a compromise that didn't adversely impact the implementation of perl for majority of users. I'm sure the 4k number is tied to the 'small' sbrk request size, so I'm think increasing beyond 4k is going to mean tweeking malloc. Perhaps also increasing the number of buffers used in parsing an input stream into line chunks.

    Now for one of those dumb, it's-not-my-budget, questions: Why not buy ( or ask for) a bigger local disk? 8GB is small now days.


    s//----->\t/;$~="JAPH";s//\r<$~~/;{s|~$~-|-~$~|||s |-$~~|$~~-|||s,<$~~,<~$~,,s,~$~>,$~~>,, $|=1,select$,,$,,$,,1e-1;print;redo}
      Concerning bying new disks: I was trying to make an example with numbers with the 8GB case, to show that it is already doing bad things to the NFS server. In actuality, the local disks on our machines permit 60 GB scratch each. However, some rule files are 123 GB (and 500 mio lines).
Re: 4k read buffer is too small
by sgifford (Prior) on Jun 17, 2008 at 03:40 UTC

    You might want to look at your NFS client to see if it can be of any help. Readahead could help here a great deal without changing Perl; look at the rsize NFS option, and any other options you have in your NFS client. You will need to test by running tcpdump or looking at your NFS stats, since Perl will still be doing 4K reads, but the OS will be doing larger reads behind the scenes.

    If you're only reading the file from beginning to end, another useful trick is to write a small program to read files in whatever blocksize you need (for example with sysread) and write them to standard output; then you can run that program and pipe its output to your actual program, which can read from the pipe in 4KB blocks without affecting how the NFS server is accessed. If you need to seek around this won't work, but sometimes it can be helpful.

      If you're only reading the file from beginning to end, another useful trick is to write a small program to read files in whatever blocksize you need (for example with sysread) and write them to standard output; then you can run that program and pipe its output to your actual program, which can read from the pipe in 4KB blocks without affecting how the NFS server is accessed. If you need to seek around this won't work, but sometimes it can be helpful.

      Yes, strong agreement to this trick. My office neighbor also suggested this work-around, since we have at least 2 CPUs per node, and up to 8 CPUs per node, but most often, the actual computation only takes 1 CPU. CPU cycles are cheap!

      As for the NFS client tuning, I will convey the message, but I suspect that the admins already did quite a bit of tuning. After all, our directory requests are served from a different physical machine than the data blocks. Myself, I don't have god privileges on any of the machines.

      XXX:/export/samfs-XXX01 /auto/XXX-01 nfs rw,nosuid,noatime,rsize=32768 +,wsize=32768,timeo=15,retrans=7,tcp,intr,noquota,rsize=32768,wsize=32 +768,addr=10.125.0.8 0 0

      The readahead sounds intriguing. How would it work, if 200 clients tried to read the same file, though slightly offset in start time? Wouldn't read-ahead aggravate the server load in this case?

        XXX:/export/samfs-XXX01 /auto/XXX-01 nfs rw,nosuid,noatime,rsize=32768 +,wsize=32768,timeo=15,retrans=7,tcp,intr,noquota,rsize=32768,wsize=32 +768,addr=10.125.0.80 0
        Interesting, that should be reading in 32KB blocks. You would still see 4K blocks with strace, though, which might be throwing off your analysis. Try seeing if the output of nfsstat or tcpdump matches what you'd expect from strace. If you find that it actually is reading in larger blocks, your sysadmins can try increasing rsize further.

        Also, I seem to recall that you need NFSv3 to read blocks larger than 16K, so if you're not getting the full 32K you are asking for, you might want to look at that.

        The readahead sounds intriguing. How would it work, if 200 clients tried to read the same file, though slightly offset in start time? Wouldn't read-ahead aggravate the server load in this case?
        I'm not familiar with the internals of the Linux NFS code, but generally readahead will write into the buffer cache, and then client requests will be read from there. As long as it doesn't run out of memory it should do the right thing in the scenario you describe.
      ... to write a small program to read files in whatever blocksize you need ...

      It just occurred to me: The small program is called dd:

      dd if=largefile ibs=8M | perl ... | dd of=newfile obs=8M
        genius!
Re: 4k read buffer is too small
by starbolin (Hermit) on Jun 17, 2008 at 02:45 UTC

    Dumb question: What command are you using to measure read sizes? I'm asking cause I've been playing with iostat, perl, and large files and I'm seeing reads from the disk at 16KB which is FreeBSD's buffer size. So I'm thinking the bottleneck may be in the NFS drivers and not in perl?? Someone correct my thinking here.


    s//----->\t/;$~="JAPH";s//\r<$~~/;{s|~$~-|-~$~|||s |-$~~|$~~-|||s,<$~~,<~$~,,s,~$~>,$~~>,, $|=1,select$,,$,,$,,1e-1;print;redo}

      I wanted to know the number of read(2) calls, so I used

      strace -e read perl ...

      Each of these reads hits the kernel's VFS, as they go from userland to kernelland. According to the admins, each read will incur an NFS request to the server. Too many simultaneous requests will topple the server. Less NFS requests, as generated by a larger buffer reads, are friendlier to the server; even, if they are not necessarily speeding up my program.

        I think your admins are lying to you. The NFS block size is determined when you isssue mount to tie the NFS driver into your file system. Just by co-incidence the default block size is also 4k. The NFS block size determines when and how much data is requested from the server not the application's IO block size. See your systems mount manual page.

        After doing just a tiny bit of reading and a little bit of testing on my system I'm convinced that modifying perl's block size would be a wasted effort. It would not change the size of the NFS requests to the server.


        s//----->\t/;$~="JAPH";s//\r<$~~/;{s|~$~-|-~$~|||s |-$~~|$~~-|||s,<$~~,<~$~,,s,~$~>,$~~>,, $|=1,select$,,$,,$,,1e-1;print;redo}