It's bad manners to slurp

Slurp

v.t. To read a large data file entirely into core before working on it. This may be contrasted with the strategy of reading a small piece at a time, processing it, and then reading the next piece.

I am writing this meditation as I have had a recent experience with some application code that slurped an entire 3,700,000 rows worth of database objects. This caused the database server to crash, having been denied additional memory from the operating system.

The database crash resulted in overnight callouts for a week (to me, as I was on call ):. Restarting the database, the application was able to continue running - in this case, because the DB had lost and rebuilt its cache, so had a smaller memory footprint.

Working with a consultant from the database vendor, we found the offending application code. Here is what it was doing (pseudocode):

select * from foo into linked list of object pointers

foreach object pointer

   Retrieve the object
   Write its contents to a temporary file in .csv format

fork a bcp command to load the data into Sybase
[download]

We realised that the code just wasn't scalable. Especially as the number of objects in the source database always grows, and is already at 3.7 million.

The problem was compounded by the locking scheme - as the source database did not know what was going to happen to the objects, it took out a lock on each one - 3.7 million locks! This is why the DB server was crashing instead of the application segfaulting.

The solution we arrived at was to use a database cursor, and retrieve the objects in batches. Also, we knew to specify no locking when retrieving the objects - hence no lock problem. Finally, the objects were properly released and garbage collected (this was C++) before retrieving the next batch.

relevance to DBI: sip the data rather than slurping

I am aware of a similar issue with DBI code. In particular, the use of selectall or fetchall, compared with fetchrow. I have seen instances.

The golden rule is to always think about how many rows you are going to get back from your query. If you are comfortable about holding all these rows in memory, then by all means slurp. For very large tables, you are much better off using fetchrow to retrieve the data a row at a time (I am not aware of a mechanism for retrieving multiple rows as batches - but that would be nice).

Also consider how much work is being done by the database, and how much by the application. Consider doing most of the work in joins and where clauses - this way, the database server gets to do, and to optimise, most of the work.

Slurping files

A similar issue occurs with reading in of flat files - whether to use the diamond <> operator in list context to slurp, or in scalar context as an iterator.

Once again, if the file is small, slurping is OK.

Slurpy slurpy sleep sleep

Another consideration is if the file is not a real file, but a pipe or an IPC socket. In this case, the application will pause until the entire file is available (i.e. the sender has sent EOF). This pause may last forever if the sender never closes the file.

Considerations for Tk programming

A running Tk application spends nearly all of its time in the MainLoop, and needs to. This is how Tk can service all the user mouse events.

Application code runs inside callbacks. It is desirable for every callback to return as quickly as possible. Delays here result in a noticeable degradation to response time and usability.

Needless to say, Tk applications need to read files. For this purpose is Tk::fileevent, which arranges for a callback to get called whenever a file handle becomes readable or writable. In this way, avoid slurping the file; instead, read a line using fileevent (which is retriggered if there is another line of text in the file).

When it comes to executing operating system commands, Tk::IO (wrongly named in my opinion) can be used to manage the forking, and the capture of the output.

--
I'm Not Just Another Perl Hacker

Comment on It's bad manners to slurp Download Code

Replies are listed 'Best First'.
Re: It's bad manners to slurp by Abigail-II (Bishop) on Apr 29, 2004 at 12:33 UTC
Well, what was happening was a combination of bad designs, of which slurping in all the data was just one of them. Other factors that played an important role: use of a poor database, one that didn't upgrade 3.7 million row locks to a table lock, and the fact that application code was run on the database server. But what I fail to understand is why there is an application involved at all. From the pseudo-code I get the impression that all what's done is fetching all the data from a table, writing that data to a file, then loading it back into a (different?) database. Why not have the one database dump its table using its native tools? Abigail	[reply]
Re: Re: It's bad manners to slurp by rinceWind (Monsignor) on Apr 29, 2004 at 13:02 UTC
Actually, the source database is a wierd beast called Versant. It's rather an object persistence store than a database, and it is non-relational. When it comes to locking strategies, Versant tends to assume that you know what you are doing (something which was not the case in this instance). Why not have the one database dump its table using its native tools? That's a thought. The client where I am working do tend to have a philosophy of doing everything in application code (C++, yeugh). I looked at `db2tty -D $DBAME -i myClass` [download] and this does dump all objects in myClass to stdout. There are issues with escaping and format conversion, and it would be hard to convince my colleagues to abandon the extractor code they are already using. Anyone else out there using Versant? -- I'm Not Just Another Perl Hacker	[reply] [d/l]
Re: It's bad manners to slurp by samtregar (Abbot) on Apr 29, 2004 at 15:07 UTC
I am aware of a similar issue with DBI code. In particular, the use of selectall or fetchall, compared with fetchrow. I have seen instances. Depending on your DBD, calling fetchrow() may also slurp the entire result into memory. It just won't give it to you all at once, which might reduce memory usage to some degree in exchange for the added overhead in repeated fetchrow() calls. If you're serious about avoiding this problem in DBI then you need to use LIMIT and OFFSET (or their cursorly equivalent) to step through your result set. -sam	[reply]
Re: Re: It's bad manners to slurp by mpeppler (Vicar) on Apr 29, 2004 at 15:26 UTC
Depending on your DBD, calling fetchrow() may also slurp the entire result into memory. It just won't give it to you all at once, which might reduce memory usage to some degree in exchange for the added overhead in repeated fetchrow() calls. I would contend that this is a bug, and a pretty serious one at that. While pre-fetching data is (often) a good idea, it should never be done blindly without checking the available space (and pre-fetching X million rows doesn't make sense, anyway!) Michael	[reply]
Re: Re: Re: It's bad manners to slurp by samtregar (Abbot) on Apr 29, 2004 at 15:31 UTC
Contend all you want, as far as I know it's the truth for at least DBD::mysql and DBD::Pg. I imagine it would be fairly easy to test this theory with a properly giant DB and GTop. -sam	[reply]
Re: Re: Re: Re: It's bad manners to slurp by tantarbobus (Hermit) on Apr 29, 2004 at 16:50 UTC
Re: Re: Re: Re: Re: It's bad manners to slurp by eserte (Deacon) on Apr 29, 2004 at 17:12 UTC
Some notes below your chosen depth have not been shown here
Re: Re: Re: Re: Re: It's bad manners to slurp by samtregar (Abbot) on Apr 29, 2004 at 17:06 UTC
Re: Re: Re: Re: It's bad manners to slurp by mpeppler (Vicar) on Apr 29, 2004 at 16:03 UTC
Re: It's bad manners to slurp by zentara (Archbishop) on Apr 29, 2004 at 19:56 UTC
I would think if its a "mission-critical" database, shouldn't the sysadmin have some sort of ulimits set on the users? I'm not really a human, but I play one on earth. flash japh	[reply]
Re: It's bad manners to slurp by ambrus (Abbot) on May 02, 2004 at 16:02 UTC
Slightly relating issue. Is `while (local $_= each %hash) { .... }` [download] better than `for (keys %hash) { .... }` [download] ? Does perl optimize the latter or does it build a possibly large temporary list of the keys? I know that with arrays, `for (@array) { .... }` [download] is fast.	[reply] [d/l] [select]
Re: Re: It's bad manners to slurp by BrowserUk (Patriarch) on May 02, 2004 at 17:31 UTC
The for version creates a (potentially large) list, whereas the while version does not. The cost of allocating the list is offset against repeated calls to each, with the result that for small lists, for iterates faster, but on larger lists the while version wins out. A quick test show that the break even point on my system is around 8000 elements. YMMV use Benchmark qw[ cmpthese ]; $h{ $_ } = 0 for 1 .. 100; cmpthese( -3, { for => q[ $h{ $_ }++ for keys %h; ], while => q[ $h{ $_ }++ while $_ = each %h; ] }); Rate while for while 11937/s -- -23% for 15544/s 30% -- $h{ $_ } = 0 for 1 .. 1000; cmpthese( -3, { for => q[ $h{ $_ }++ for keys %h; ], while => q[ $h{ $_ }++ while $_ = each %h; ] }); Rate while for while 1165/s -- -26% for 1581/s 36% -- $h{ $_ } = 0 for 1 .. 10000; cmpthese( -3, { for => q[ $h{ $_ }++ for keys %h; ], while => q[ $h{ $_ }++ while $_ = each %h; ] } ); Rate for while for 80.0/s -- -8% while 86.9/s 9% -- $h{ $_ } = 0 for 1 .. 5000; cmpthese( -3, { for => q[ $h{ $_ }++ for keys %h; ], while => q[ $h{ $_ }++ while $_ = each %h; ] }); Rate while for while 207/s -- -15% for 243/s 17% -- $h{ $_ } = 0 for 1 .. 8000; cmpthese( -3, { for => q[ $h{ $_ }++ for keys %h; ], while => q[ $h{ $_ }++ while $_ = each %h; ] }); Rate for while for 119/s -- -0% while 119/s 0% -- [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]


The stupid question is the question not asked
	PerlMonks