Of large database tables and high memory usage.

mseabrook has asked for the wisdom of the Perl Monks concerning the following question:

Magnanimous monks,

I'm faced with two problems that I'm hoping your insight can help me resolve.

The first problem, and I have some solutions that help work around it, is the fact that selecting all records from a database table can result in high, and possibly unacceptable memory usage.

Consider this code:

my $sth = $dbh->prepare("SELECT * FROM foo");
   $sth->execute;
[download]

Upon execute(), DBI requests a result set from the database (in my case, MySQL). After that point, the resulting data is held in memory, to be later accessed with calls to fetchrow(), and the like. In my experience, if the table has a very large number of records, this simple bit of code can result in very high memory usage.

One solution is to impose a maximum limit on the number of records selected at a time, essentially paginating the results. I created a module awhile back that does just this. You feed it a query and a maximum record limit (ie. 25,000) and it returns something akin to a statement handle that can be used to fetch records. Internally, it chops up the query and appends a LIMIT/OFFSET clause and only selects from the database as needed.

Another solution is to only select the record ids, and then individually select the records by id. Granted, this would be much slower. I'd be trading memory usage for CPU time, but that seems to be a likely tradeoff in any scenario.

The real problem for me, and the one that I'm more curious about, is finding a way to abstract this problem away, such that I, and the programmers that I work with, will not need to worry about bringing a server to its knees by executing a query that results in unacceptable memory usage. Ideally, the solution would be in place at a reasonably low level such that the programmers could take advantage of it while performing a manual query, or performing a search with our homebrew ORM.

Speaking of ORMs, are there any out there that handle this implicitly? We would be willing to switch to say, Class::DBI, DBIx::Class, Rose::DB::Object, SomethingElse, if it had something in place to help control memory usage.

Part of me thinks this is a non-issue and I should just write code that suits the application, but another part of me feels that if I'm writing a script that needs to do something with all records in a given table, I shouldn't need to alter my strategy based on the number of records in the table. Am I way off-base with this? Am I being too idealistic? Is this problem in my head? :)

Comment on Of large database tables and high memory usage. Download Code

Replies are listed 'Best First'.
Re: Of large database tables and high memory usage. by kyle (Abbot) on Jul 27, 2007 at 19:20 UTC
You might find Re: DBI + 'SELECT *' - Memory Use Galore? useful.	[reply]
Re^2: Of large database tables and high memory usage. by mseabrook (Beadle) on Jul 27, 2007 at 19:28 UTC
Apparently my Super Search skills aren't up to snuff. :/ That node is indeed useful to me. Thanks!	[reply]
Re^2: Of large database tables and high memory usage. by doom (Deacon) on Jul 29, 2007 at 19:18 UTC
Thanks for posting this pointer. I was tremendously confused by this question myself... I could've sworn that DBI was architected to avoid this problem. You shouldn't get an entire result set unless you do a "fetchall" of some sort, and if you want to process the set row by row you use a "fetchrow". Just doing an "execute" shouldn't be placing any burden on the client... This appears to be yet-another mysql gotcha... they do something really lame by default to make their server performance look better.	[reply]
Re: Of large database tables and high memory usage. by snopal (Pilgrim) on Jul 27, 2007 at 19:13 UTC
I'm certain that you will find that most table-wide queries will be unbufferable. The reason is that for the most part, buffering is a result-data dependant operation. Another issue is that some standard query statements actually invoke a view, which itself is a remapping of the data in the DB. I've used DBI and Rose::DB::Object and found that they are optimized for highly selective queries that limit the results to a RAM memory container. Rose::DB::Object::Manager abstraction will provide iterative method capabilities, but since it is build on top of DBI, it will not provide "data window" memory management. I suggest that your applications will benefit from intelligent design rather than generic tools. I'd be very interested in any discussion that suggests otherwise because I've experienced similar design requirements.	[reply]
Re^2: Of large database tables and high memory usage. by mseabrook (Beadle) on Jul 27, 2007 at 19:47 UTC
I suggest that your applications will benefit from intelligent design rather than generic tools. I tend to follow that philosophy, but there are other programmers working on this codebase and it's sometimes unreasonable to ask the same of them. If I can put together a tool that presents a simple way of doing X, while solving problems Y and Z, we'll wind up with more consistent code, and we'll be able to focus on the higher levels of abstraction while creating programs. Historically, this has come at the expense of performance, though. Anyways, it looks like the high memory usage that I'm seeing might be a MySQL-exclusive phenomenon.	[reply]
Re^2: Of large database tables and high memory usage. by Anonymous Monk on Aug 01, 2007 at 00:05 UTC
I've used DBI and Rose::DB::Object and found that they are optimized for highly selective queries that limit the results to a RAM memory container. Rose::DB::Object::Manager abstraction will provide iterative method capabilities, but since it is build on top of DBI, it will not provide "data window" memory management. Some DBI DBDs provide options for this sort of thing. For example, DBD::mysql provides a "mysql_use_result" attribute to prevent a total transfer of data to the client on execute(). (More info.) (And incidentally, Rose::DB provides access to this attribute.)	[reply]
Re: Of large database tables and high memory usage. by perlNinja (Initiate) on Jul 27, 2007 at 19:36 UTC
The only time I've encountered this kind of problem was when I had to merge two databases. My options were... 1. Read the entire contents of table A (smaller table). * Iterate over table A's result set and do individual selects from table B * Update row in table A or 2. Read Limited and Offset slices of table B * Update row in table A Since I was able to lock access to both databases, option 2 turned out to be the fastest. If other users were able to add or update rows while I was doing this, it may have broken the script since my Limits and Offsets could've gotten screwed up.	[reply]
Re: Of large database tables and high memory usage. by leocharre (Priest) on Jul 28, 2007 at 17:06 UTC
How many results is 'a lot' ? 600k? 80 million? What does your table look like? How many cols and what types? I find it really strange that you bring up memory as the issue, with the problem you mention, I have always experienced a cpu issue- and memory has been a joke. My 'high' number row counts are like.. 12,400,855- that may be nothing compared to yours. Dunno. What I do with high intensity queries- is kind of what you mention- is that i abstract it away. I only fetch x number of results. I let the api give out one at a time and then when no more are there, i serve. I want to show you one example. The following example keeps a query in an object instance. # keeps a list fed in object sub get_next_indexpending { my $self= shift; unless( defined $self->{pending_queue} and scalar @{$self->{pending +_queue}} ){ # this operation can be expensive. SO, i get 50 at a time, and c +ache it in the object # as the API, it seems like you just keep asking for the next on +e # we do not actually query the db for the next one, because that + would be EXCRUCIATINGLY SLOW # even asking for many more, could be slow # i've fiddled around with maybe 3 or 4 ways of doing this opera +tion, this works well # it's been debugged a lot, there have been MANY bugs doing this +, so DONT FUCK WITH IT :-) # multiple indexers can have the same list- that's ok, because + only one will lock # the funny thing is if you select 50 at a time, and you have 51 + indexers.. then what???? # I THINK THERE IS A RACE CONDITION HERE # I think there should be a formula for : # ( how many indexers are running * NUMBER ) = LIMIT my $LIMIT = 50; # we could be querying a LOT ? It seems that would be wasteful t +o all hell. # maybe the select can be random or ordered in different ways, a +lternating ???? debug("pending queue is empty.. "); #make sure it's defined $self->{pending_queue}=[]; if (defined $self->{gpd_stopflag} and $self->{gpd_stopflag} ){ debug("stopflag was raised, no more in pending. Will prepare +and execute.."); return; # so we return undef. } debug("will refeed next $LIMIT"); # this is a hack replacement since i cant prepare with passing o +ffset my $gpd = $self->dbh_sth( # can not figure out how to pass offse +t to a prepped query 'SELECT abs_path, md5sum FROM files WHERE NOT EXISTS'. '(SELECT id FROM md5sum WHERE md5sum.md5sum = files.md5 +sum LIMIT 1)'. "GROUP BY md5sum LIMIT $LIMIT" ); # i realized getting first 50 or first whatever.. IS ALWAYS VALI +D # Because if it is already been indexed by another indexer.. tha +t operation is committed # and subsequent selects to next pending list.. will no longer r +eturn that file as a result # SO, dont use an incrementing offset. seems like it made sense. +. but NO. $gpd->execute; debug("ok.\nWill iterate through results.."); while (my @row = $gpd->fetchrow_array){ # WAS USING for!!! # debug("into queue [@row])"); push @{$self->{pending_queue}}, \@row; } debug(sprintf "got [%s]\n", scalar @{$self->{pending_queue}}); # how about.. if count is less then 50, turn on a stop flag so w +e dont keep requesting pending.. ??? if (scalar @{$self->{pending_queue}} < 50 ){ $self->{gpd_stopflag} = 1; debug( sprintf "got less then 50 (got %s), turning on stop fl +ag\n", scalar @{$self->{pending_queue}}); } scalar @{$self->{pending_queue}} or warn("no more pending files +found"); } my $a = shift @{$self->{pending_queue}}; defined $a or return; my ($abs_path,$md5sum) = @$a; debug("returning abs path, $md5sum\n"); $abs_path or die("missing abs path"); $md5sum or die("missing md5sum"); return ($abs_path,$md5sum); } =head3 get_next_indexpending() no argument returns abs_path, md5sum string for next file in queue you should attempt to lock afterwards beacuse of the nature of indexing, it can take a long time, and we may + be running multiple indexers, so attempting to lock is needed if none in pending, returns undef everytime you call get_next_indexpending, it returns a different file while( my ($abs_path,$md5sum) = $self->get_next_indexpending ){ # lock or next } The md5sum string is the md5 hex sum for the file data at the time the + files table was updated you should check it again on disk so you know it has not changed in th +e meantime, and also, if you are remote indexing to make sure the data was not corrupted in transit This sub DOES return either those two values OR undef. =cut [download] This is straight from the source, so there's some seemingly irrelevant code. My point is that Yes, you are insane. Thinking you will solve this problem accross the board for all kinds of situations is .. nuts. If you could do that, you would put all database people out of work. There's a reason why 'you should possibly not be asking this on perlmonks', and in the mysql site instead. Because what you are trying to do with mysql, is a world unto itself!!! I used to think I would code perl ,then just .. you know.. just do a teeny bit of sql lookup on the side.. NO! What database you use, your column types, if you have an index, if you normalize.. this makes or breaks your sh1+ entirely. I think if you try to 'abstract the problem away' in some API, you'll waste a ton of time and come up with a wonderful set of solutions for less general situations then you would like. But.. shucks, you might solve the whole problem too.	[reply] [d/l]
Re^2: Of large database tables and high memory usage. by mseabrook (Beadle) on Jul 30, 2007 at 23:01 UTC
Given what I'm seeing, the tables don't have to be that large, in terms of record count, to have a large impact on memory usage. To give you an example, I have one table with 2,048,812 records. A decent amount, but not other-worldly. I don't have a wonderful grasp on how linux uses or reports memory, so I'll show you what top is showing me when I select from that table. This is with a fresh perl instance: `PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14069 mseabroo 17 0 9336 5540 1816 S 0.0 2.1 0:05.68 iperl` [download] And after selecting all records from the table with a simple "prepare", "execute": `PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14069 mseabroo 16 0 250m 159m 844 S 0.0 61.1 0:11.22 iperl` [download] This is immediately after calling execute(). It should be noted that I'm selecting all columns here, but collectively, they don't amount to much; a couple fixed-width chars, a few integers, a few doubles, and a datetime. In any case, it appears mysql prefers to put the burden of processing on the client, which is why it's tossing the entire result set over to my script. This behavior, while default, is optional, though. In spite of that, I don't yet know if it will be feasible to turn it off.	[reply] [d/l] [select]
Re: Of large database tables and high memory usage. by oha (Friar) on Jul 30, 2007 at 12:02 UTC
executing the same query more times, using the limit/offset, can cause differend results. data can change between the queries unless you use an high isolation level (and in this case too, you can get an error instead of the data) for what i know, anyway, the data is managed by the backend, and not dumped to the driver whiting the execute() call. or at least it is supposed to. Oha	[reply]
Re^2: Of large database tables and high memory usage. by mseabrook (Beadle) on Jul 30, 2007 at 23:06 UTC
executing the same query more times, using the limit/offset, can cause differend results That's a really good point. So far I've only been using the strategy in scenarios where that wouldn't be an issue, but I'll have to keep it in mind if I want to make it a more general solution.	[reply]