Perl Read-Ahead I/O Buffering

jeffthewookiee has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl Read-Ahead I/O Buffering by dave_the_m (Monsignor) on Oct 26, 2006 at 15:24 UTC
Perl's IO library already does buffering: it typically reads in data in 4K chunks, then returns the next N characters up to \n Dave.	[reply]
Re^2: Perl Read-Ahead I/O Buffering by jeffthewookiee (Sexton) on Oct 26, 2006 at 17:28 UTC
Well, is there a way to set the buffer to a large size?	[reply]
Re^3: Perl Read-Ahead I/O Buffering by graff (Chancellor) on Oct 27, 2006 at 01:58 UTC
4K (or whatever buffer size perl is using) is likely to be a pretty good size for an input buffer, based on lots of experience and tweaking among perl maintainers. It strikes a nice balance between competing resource demands -- a larger or smaller size might improve some things, but hinder others. Your processing is going to be line-oriented anyway, and perl's internal buffering is already optimized (in C) to deliver lines while managing the underlying block-oriented buffering. If you try doing the buffering yourself (e.g. using read as suggested in another reply), you'll end up slowing things down, because you have to write your own code to figure out the line boundaries, retain a buffer-final line fragment so that you can append the next buffer to that, and so on. It's not only slower to run, it's slower and harder to code, test and maintain. If the runtime speed of the standard `while (<>)` loop in perl is a serious issue for your task, maybe you just need to use C. But then you'll spend even more time for coding, testing and maintenance. It's a question of whose time is more important and expensive: the programmer's, or the cpu's.	[reply] [d/l]
Re^4: Perl Read-Ahead I/O Buffering (I/O speed) by tye (Sage) on Oct 27, 2006 at 16:15 UTC
Re: Perl Read-Ahead I/O Buffering by Fletch (Bishop) on Oct 26, 2006 at 15:27 UTC
You assume incorrectly. Perl will use the underlying IO routines (stdio or PerlIO or what not) to read chunks of the file in and will return just up to the next record separator at a time. Until this behind the scenes buffer is emptied (presuming the chunk read was large enough to contain up until the subsequent record separator) no disk IO will occur.	[reply]
Re: Perl Read-Ahead I/O Buffering by swampyankee (Parson) on Oct 26, 2006 at 17:24 UTC
I'm not a Perl guru, but I have been knocking around computing for several decades. My experience has been that I/O buffering, especially read i/o buffering, is managed by the o/s, via the drivers. Again, from my experience, there is fairly little you can do at the application level to manage this. You could, in theory set $/ to a (fairly large) numerical value, and see if it makes a noticeable difference, but I suspect the overhead introduced in explicitly processing end-of-record markers would eat up any savings. emc At that time [1909] the chief engineer was almost always the chief test pilot as well. That had the fortunate result of eliminating poor engineering early in aviation. —Igor Sikorsky, reported in AOPA Pilot magazine February 2003.	[reply]
Re: Perl Read-Ahead I/O Buffering by samtregar (Abbot) on Oct 26, 2006 at 18:55 UTC
Doing your own buffering probably won't help you, since Perl already does that. However, all is not lost - you can beat Perl if you're willing to code in C. For example, check out Text::CSV_XS - it beats all the other CSV parsers by doing low-level IO and parsing with a hand-coded state-machine. If your data is CSV or similar you might able to use it directly. If not you might be able to borrow the technique and adapt it to your format. -sam	[reply]
Re: Perl Read-Ahead I/O Buffering by sth (Priest) on Oct 26, 2006 at 20:34 UTC
Have you tried read()? If the records are fixed length you could read in for a length * number rows and then split on the newline or read in a big chunk and add logic to keep a partial record from the end of the buffer and add to it on the next read. `while ( read($fh, $buf, $len) ) { @lines = split "\n", $buf; . . . }` [download] It may be faster, worth a try.	[reply] [d/l]
Re: Perl Read-Ahead I/O Buffering by NiJo (Friar) on Oct 26, 2006 at 20:22 UTC
Firstly I'd make sure that you are really limited by the read speed. Only then it makes sense to optimize. I've no idea where the 160 GB data is going to after munging but this (database?) might also be your bottleneck. Profiling the application should be your first step. The slurp approach with a manageable file size (approx. RAM size) should be close to the optimum. Even if you can't use it in the finished program it makes sense to benchmark against line by line. The line by line approach has a single loop with all the round trip times (from disk to destination). Cutting that into multiple processes/threads makes better use of resources. `buffer < infile \| your_app.pl` [download] effectively answers your initial question. http://search.cpan.org/src/TIMB/DBI_AdvancedTalk_2004/index.htm is might be a first read for output bottlenecks.	[reply] [d/l]
Re: Perl Read-Ahead I/O Buffering by hsinclai (Deacon) on Oct 27, 2006 at 00:32 UTC
Hi! I saved this post by BrowserUK long ago: Re: Muy Large File In that particular case, it was having to do with performing a search+replace within a very large file, using sysread, it works quite well (although not exactly on the point of your original question, it might help) -Harold	[reply]
Re: Perl Read-Ahead I/O Buffering by jbert (Priest) on Oct 27, 2006 at 08:37 UTC
If you run your application under 'strace' you'll be able to see the read() syscalls being made to request the data from the OS. As others have mentioned, this will be almost certainly be done in a 'suitable chunksize'. Lots of games are being played here. The OS (if it is Linux, at least) may detect that you're doing a sequential read of the file and start doing read-ahead to get the data you're likely to read into kernel RAM before you even issue the read() for the next bit. Your hard drive has a few (8? 16?) MBytes of RAM on it and again does similar readahead tricks. Basically, everyone has optimised everything for the common case of the application developer sequentially reading a file from start to finish, so just go for it :-) If you're still interested in tweaking, check out iostat, vmstat and sar to profile your running system to see what resource is being maxed. If you hit 100% disk utilisation, then you might be disk limited. In which case you can compare the time taken for a run of your app against the time taken by 'dd if=/your/file of=/dev/zero bs=4096', which should be pretty much a best-case for your box. You can even play with different chunk sizes with that command if you want to see if that makes a noticeable difference. Or create some soft RAID arrays if you have multiple disks and too much time on your hands. Oh, and if you are going to take timings like that, you'll have to reboot in between each one and not read the file, otherwise you won't be reading it from disk - some of it may have been cached.	[reply]
Re: Perl Read-Ahead I/O Buffering by dk (Chaplain) on Oct 30, 2006 at 13:39 UTC
Did you try any of Mmap / Sys::Mmap / IPC::Mmap ?	[reply]


We don't bite newbies here... much
	PerlMonks