It's bad manners to slurp

Slurp

v.t. To read a large data file entirely into core before working on it. This may be contrasted with the strategy of reading a small piece at a time, processing it, and then reading the next piece.

I am writing this meditation as I have had a recent experience with some application code that slurped an entire 3,700,000 rows worth of database objects. This caused the database server to crash, having been denied additional memory from the operating system.

The database crash resulted in overnight callouts for a week (to me, as I was on call ):. Restarting the database, the application was able to continue running - in this case, because the DB had lost and rebuilt its cache, so had a smaller memory footprint.

Working with a consultant from the database vendor, we found the offending application code. Here is what it was doing (pseudocode):

select * from foo into linked list of object pointers

foreach object pointer

   Retrieve the object
   Write its contents to a temporary file in .csv format

fork a bcp command to load the data into Sybase
[download]

We realised that the code just wasn't scalable. Especially as the number of objects in the source database always grows, and is already at 3.7 million.

The problem was compounded by the locking scheme - as the source database did not know what was going to happen to the objects, it took out a lock on each one - 3.7 million locks! This is why the DB server was crashing instead of the application segfaulting.

The solution we arrived at was to use a database cursor, and retrieve the objects in batches. Also, we knew to specify no locking when retrieving the objects - hence no lock problem. Finally, the objects were properly released and garbage collected (this was C++) before retrieving the next batch.

relevance to DBI: sip the data rather than slurping

I am aware of a similar issue with DBI code. In particular, the use of selectall or fetchall, compared with fetchrow. I have seen instances.

The golden rule is to always think about how many rows you are going to get back from your query. If you are comfortable about holding all these rows in memory, then by all means slurp. For very large tables, you are much better off using fetchrow to retrieve the data a row at a time (I am not aware of a mechanism for retrieving multiple rows as batches - but that would be nice).

Also consider how much work is being done by the database, and how much by the application. Consider doing most of the work in joins and where clauses - this way, the database server gets to do, and to optimise, most of the work.

Slurping files

A similar issue occurs with reading in of flat files - whether to use the diamond <> operator in list context to slurp, or in scalar context as an iterator.

Once again, if the file is small, slurping is OK.

Slurpy slurpy sleep sleep

Another consideration is if the file is not a real file, but a pipe or an IPC socket. In this case, the application will pause until the entire file is available (i.e. the sender has sent EOF). This pause may last forever if the sender never closes the file.

Considerations for Tk programming

A running Tk application spends nearly all of its time in the MainLoop, and needs to. This is how Tk can service all the user mouse events.

Application code runs inside callbacks. It is desirable for every callback to return as quickly as possible. Delays here result in a noticeable degradation to response time and usability.

Needless to say, Tk applications need to read files. For this purpose is Tk::fileevent, which arranges for a callback to get called whenever a file handle becomes readable or writable. In this way, avoid slurping the file; instead, read a line using fileevent (which is retriggered if there is another line of text in the file).

When it comes to executing operating system commands, Tk::IO (wrongly named in my opinion) can be used to manage the forking, and the capture of the output.

--
I'm Not Just Another Perl Hacker

Back to Meditations