http://qs321.pair.com?node_id=342802

AidanLee has asked for the wisdom of the Perl Monks concerning the following question:

While more algorithmic rather than Perl specific, I have a programming problem that could use some insight.

I'm storing a set of objects in a database, where the rows are tagged by object id for easy retrieval. The properties themselves may be any kind of textual data from short strings, to numbers to dates to large blocks of text. In order to not be pulling more data than is necessary from the database, not all properties may be pulled for a given object when the object is instantiated in the program. Instead the program will ask for a set of properties it expects to use for the task at hand.

The problem is that later on another part of the program may want other properties that may or may not overlap with the properties that have already been retrieved, and so those properties have to be pulled from the database (without re-pulling the already retrieved properties, of course).

Requesting objects from the database singly, this doesn't pose a problem. But generally objects are pulled from the database in batches to reduce the number of queries, e.g.:

retrieveByID( properties => [ 'foo','bar','baz' ], ids => [ 4,5,62 ] ); #later.... retrieveByID( properties => [ 'a','b','c' ], ids => [ 7,8,89 ] ); #later... retrieveByID( properties => [ 'foo','c','d' ], ids => [ 4,8,102 ] );

We don't want to re retrieve the whole object from the database if it already exists in the system, but we do want to get any missing properties without pulling already loaded properties a second time. And since queries are generally time-expensive, we want to do it in as few queries as possible

It occurred to me that what I've got is something of a matrix-structure:

o1 o2 o3 o4 p1 x x p2 x x p3 x x p4 x p5 x x

Now I know very little about matrix math, and perhaps an answer to my question does not lie there. But given a situation similar to the above, where objects o1-o4 have properties loaded from amongst p1-p5 (at the x's), is there a way to discern a near optimal or optimal number of queries to retrieve all of the missing data while minimising the amount of re-retrieved data? I also need to consider that an algorithm to come up with these queries also needs to account for situations where there are a great many more properties than objects, or vice versa.

The clincher of course is that the time taken to determine an optimal or near-optimal set of queries plus the time to run said queries has to be faster than a full re-retrieval or a one-query-per-id scheme, or it is rendered useless.