Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Perl Read-Ahead I/O Buffering

by jeffthewookiee (Sexton)
on Oct 26, 2006 at 15:17 UTC ( [id://580786]=perlquestion: print w/replies, xml ) Need Help??

jeffthewookiee has asked for the wisdom of the Perl Monks concerning the following question:

I'm currently writing scripts to read pipe-separated values from a very large text file (160+ GB). It's necessary to process every line, but this approach: @lines = <BIGGUNFILE> is unfeasible due to memory requirements. Currently the scripts use standard line by line behavior: while(my $line = <BIGGUNFILE>) I assume that this is pretty inefficient since it should cause a lot of very small reads instead of reading the data in large chunks. I've also experimented with Tie::File, which reports that it will buffer data, but this too seems to be line by line. I only need to process each line once, so buffering this way doesn't help me much. Is there another approach in Perl wherein I can read in larger chunks of data at a time, but yet not slurp in the whole file? In other words, I'd like to read-ahead and buffer a set of X lines so that the IO would be faster...

Replies are listed 'Best First'.
Re: Perl Read-Ahead I/O Buffering
by dave_the_m (Monsignor) on Oct 26, 2006 at 15:24 UTC
    Perl's IO library already does buffering: it typically reads in data in 4K chunks, then returns the next N characters up to \n

    Dave.

      Well, is there a way to set the buffer to a large size?
        4K (or whatever buffer size perl is using) is likely to be a pretty good size for an input buffer, based on lots of experience and tweaking among perl maintainers. It strikes a nice balance between competing resource demands -- a larger or smaller size might improve some things, but hinder others.

        Your processing is going to be line-oriented anyway, and perl's internal buffering is already optimized (in C) to deliver lines while managing the underlying block-oriented buffering.

        If you try doing the buffering yourself (e.g. using read as suggested in another reply), you'll end up slowing things down, because you have to write your own code to figure out the line boundaries, retain a buffer-final line fragment so that you can append the next buffer to that, and so on. It's not only slower to run, it's slower and harder to code, test and maintain.

        If the runtime speed of the standard while (<>) loop in perl is a serious issue for your task, maybe you just need to use C. But then you'll spend even more time for coding, testing and maintenance. It's a question of whose time is more important and expensive: the programmer's, or the cpu's.

Re: Perl Read-Ahead I/O Buffering
by Fletch (Bishop) on Oct 26, 2006 at 15:27 UTC

    You assume incorrectly. Perl will use the underlying IO routines (stdio or PerlIO or what not) to read chunks of the file in and will return just up to the next record separator at a time. Until this behind the scenes buffer is emptied (presuming the chunk read was large enough to contain up until the subsequent record separator) no disk IO will occur.

Re: Perl Read-Ahead I/O Buffering
by swampyankee (Parson) on Oct 26, 2006 at 17:24 UTC

    I'm not a Perl guru, but I have been knocking around computing for several decades. My experience has been that I/O buffering, especially read i/o buffering, is managed by the o/s, via the drivers. Again, from my experience, there is fairly little you can do at the application level to manage this.

    You could, in theory set $/ to a (fairly large) numerical value, and see if it makes a noticeable difference, but I suspect the overhead introduced in explicitly processing end-of-record markers would eat up any savings.

    emc

    At that time [1909] the chief engineer was almost always the chief test pilot as well. That had the fortunate result of eliminating poor engineering early in aviation.

    —Igor Sikorsky, reported in AOPA Pilot magazine February 2003.
Re: Perl Read-Ahead I/O Buffering
by samtregar (Abbot) on Oct 26, 2006 at 18:55 UTC
    Doing your own buffering probably won't help you, since Perl already does that. However, all is not lost - you can beat Perl if you're willing to code in C. For example, check out Text::CSV_XS - it beats all the other CSV parsers by doing low-level IO and parsing with a hand-coded state-machine. If your data is CSV or similar you might able to use it directly. If not you might be able to borrow the technique and adapt it to your format.

    -sam

Re: Perl Read-Ahead I/O Buffering
by sth (Priest) on Oct 26, 2006 at 20:34 UTC
    Have you tried read()? If the records are fixed length you could read in for a length * number rows and then split on the newline or read in a big chunk and add logic to keep a partial record from the end of the buffer and add to it on the next read.
    while ( read($fh, $buf, $len) ) { @lines = split "\n", $buf; . . . }
    It may be faster, worth a try.
Re: Perl Read-Ahead I/O Buffering
by NiJo (Friar) on Oct 26, 2006 at 20:22 UTC
    Firstly I'd make sure that you are really limited by the read speed. Only then it makes sense to optimize. I've no idea where the 160 GB data is going to after munging but this (database?) might also be your bottleneck. Profiling the application should be your first step. The slurp approach with a manageable file size (approx. RAM size) should be close to the optimum. Even if you can't use it in the finished program it makes sense to benchmark against line by line.

    The line by line approach has a single loop with all the round trip times (from disk to destination). Cutting that into multiple processes/threads makes better use of resources.

    buffer < infile | your_app.pl
    effectively answers your initial question. http://search.cpan.org/src/TIMB/DBI_AdvancedTalk_2004/index.htm is might be a first read for output bottlenecks.
Re: Perl Read-Ahead I/O Buffering
by hsinclai (Deacon) on Oct 27, 2006 at 00:32 UTC
    Hi!

    I saved this post by BrowserUK long ago: Re: Muy Large File

    In that particular case, it was having to do with performing a search+replace within a very large file, using sysread, it works quite well (although not exactly on the point of your original question, it might help)

    -Harold

Re: Perl Read-Ahead I/O Buffering
by jbert (Priest) on Oct 27, 2006 at 08:37 UTC
    If you run your application under 'strace' you'll be able to see the read() syscalls being made to request the data from the OS. As others have mentioned, this will be almost certainly be done in a 'suitable chunksize'.

    Lots of games are being played here. The OS (if it is Linux, at least) may detect that you're doing a sequential read of the file and start doing read-ahead to get the data you're likely to read into kernel RAM before you even issue the read() for the next bit.

    Your hard drive has a few (8? 16?) MBytes of RAM on it and again does similar readahead tricks.

    Basically, everyone has optimised everything for the common case of the application developer sequentially reading a file from start to finish, so just go for it :-)

    If you're still interested in tweaking, check out iostat, vmstat and sar to profile your running system to see what resource is being maxed. If you hit 100% disk utilisation, then you might be disk limited. In which case you can compare the time taken for a run of your app against the time taken by 'dd if=/your/file of=/dev/zero bs=4096', which should be pretty much a best-case for your box. You can even play with different chunk sizes with that command if you want to see if that makes a noticeable difference. Or create some soft RAID arrays if you have multiple disks and too much time on your hands.

    Oh, and if you are going to take timings like that, you'll have to reboot in between each one and not read the file, otherwise you won't be reading it from disk - some of it may have been cached.

Re: Perl Read-Ahead I/O Buffering
by dk (Chaplain) on Oct 30, 2006 at 13:39 UTC
    Did you try any of Mmap / Sys::Mmap / IPC::Mmap ?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://580786]
Approved by Fletch
Front-paged by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (2)
As of 2024-04-20 03:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found