Re: 4k read buffer is too small

I'm curious what sort of trials lead you to say that using sysread "feels slow"... How slow does it "feel" compared to the default i/o methods? (How much do your colleagues notice your presence when you use sysread, as opposed to the default methods?)

Since almut has already pointed out how to build perl 5.10 with your own custom input buffer size, that's likely to be the way to go -- a specific build of perl for this specific app...

But I'd still be tempted to try a little more with the sysread approach (esp. since you seem to have made some progress with it already), and as for missing PerlIO's utf8 layer, well, you do still have Encode, which basically does the same thing.

And I think it's worthwhile to consider starbolin's comment about improving the use of local disk in your optimization strategy -- in addition to anything else you do. Whatever solution you pick should probably include making a one-time copy of big data chunks to local disk, if only to keep your process from stalling everyone else on the network.

(If the process happens to be modifying or rewriting file contents, all the more reason, perhaps, to work on local storage until the process is done, then "upload" the finished product to your network drives. NFS writes are more expensive than NFS reads, so the less you do NFS writes, the better.)

update: To follow up on my remark about Encode, this snippet produces no warnings about "wide character data", and outputs the appropriate 3-byte sequence (two-byte utf8 sequence for á followed by LF):

perl -MEncode -e '$_=encode("utf8","\x{e1}\n"); syswrite(STDOUT,$_)'
[download]

Doing the same thing on input involves passing your input string as the 2nd arg to decode("utf8",...) -- the return value is a perl utf8 string.

Comment on Re: 4k read buffer is too small Select or Download Code

Replies are listed 'Best First'.

Re^2: 4k read buffer is too small
by voeckler (Sexton) on Jun 17, 2008 at 03:47 UTC

I agree that the data should go to the node's scratch, the processing happens, and results are uploaded to NFS again. Never mind NFS, RAID5 is also not helping writes. However, sometimes files become so ridiculously large that the local scratch does not suffice, and I am forced to work off NFS - though I still try to put the products on scratch, and upload them to NFS afterwards.

I did write a simple FullyBuffered module basically doing sysopen, sysread into a large buffer, maintaining a cursor (to avoid unecessary string copies), etc. I was timing this against the original script using Perl's IO, and it performed, to my surprise, a little worse. The Perl IO version takes about 3 minutes for 2^20 lines, my fully buffered approached 4 minutes for 2^20 lines (dang, I tossed its log file). Of course, I do suspect that I am doing something stupidly inefficient.

I thank you very much for showing me the proper utf8 conversions. Should I continue with my module approach, it will come in handy.

[reply]


Think about Loose Coupling
	PerlMonks