http://qs321.pair.com?node_id=342802

AidanLee has asked for the wisdom of the Perl Monks concerning the following question:

While more algorithmic rather than Perl specific, I have a programming problem that could use some insight.

I'm storing a set of objects in a database, where the rows are tagged by object id for easy retrieval. The properties themselves may be any kind of textual data from short strings, to numbers to dates to large blocks of text. In order to not be pulling more data than is necessary from the database, not all properties may be pulled for a given object when the object is instantiated in the program. Instead the program will ask for a set of properties it expects to use for the task at hand.

The problem is that later on another part of the program may want other properties that may or may not overlap with the properties that have already been retrieved, and so those properties have to be pulled from the database (without re-pulling the already retrieved properties, of course).

Requesting objects from the database singly, this doesn't pose a problem. But generally objects are pulled from the database in batches to reduce the number of queries, e.g.:

retrieveByID( properties => [ 'foo','bar','baz' ], ids => [ 4,5,62 ] ); #later.... retrieveByID( properties => [ 'a','b','c' ], ids => [ 7,8,89 ] ); #later... retrieveByID( properties => [ 'foo','c','d' ], ids => [ 4,8,102 ] );

We don't want to re retrieve the whole object from the database if it already exists in the system, but we do want to get any missing properties without pulling already loaded properties a second time. And since queries are generally time-expensive, we want to do it in as few queries as possible

It occurred to me that what I've got is something of a matrix-structure:

o1 o2 o3 o4 p1 x x p2 x x p3 x x p4 x p5 x x

Now I know very little about matrix math, and perhaps an answer to my question does not lie there. But given a situation similar to the above, where objects o1-o4 have properties loaded from amongst p1-p5 (at the x's), is there a way to discern a near optimal or optimal number of queries to retrieve all of the missing data while minimising the amount of re-retrieved data? I also need to consider that an algorithm to come up with these queries also needs to account for situations where there are a great many more properties than objects, or vice versa.

The clincher of course is that the time taken to determine an optimal or near-optimal set of queries plus the time to run said queries has to be faster than a full re-retrieval or a one-query-per-id scheme, or it is rendered useless.

Replies are listed 'Best First'.
Re: algorithm help for determining efficient db data retrieval
by tachyon (Chancellor) on Apr 06, 2004 at 02:14 UTC

    IMHO you are digging yourself a hole. An object where you want one chunk of properties in one circumstance and another set of properties in another smells like two or more objects to me. You sound like you are trying to optimise somthing that does not even work yet. Chances are the bottlenecks won't be where you suspect.

    You don't seem to understand what takes time and what does not. When you retrieve data from a DB it takes almost exactly the same time to pull one byte as it does to pull lots of bytes. The reason is that a disk will deliver something like 50MB per second and the connect/parse/bind/execute/seek/disconnect overhead is the same (almost) no matter how much data you pull. The actual data delivery time is ~ 10% or less of the total transaction time.

    If you ignore the simple fact that it is far easier/faster for a DB to pull a chunk of data (say a serialised object) than it is for it to pull a whole lot of bits from that 'object', your approach will be be slower simply because the best case is you shave a few % off a single transaction (it is arguable it will not take *longer* due to increased query complexity) but a second transaction at any stage will add 95% + which totally wipes out any gain.

    I would suggest you rethink your approach. KISS and just pull all the data you need in one pass. If when you have it running you find a bottleneck then look at optimising it.

    Storable and FreezeThaw are two good serilisation solutions. Typically you don't want to store serialised object in a DB as you loose most of the value of using a RDBMS, but if you have good reasons to do it these modules can help.

    cheers

    tachyon

      Well, it's good to know that transferring a little vs a lot isn't a huge issue. That I did not know. I'm less concerned with my ability to pull some or all of a single object's properties than i am with the possibility of having to pull a small set of properties for say 400 objects, but then wanting a larger set of properties for 10 of them. While each object may be fairly lightweight (say a few hundred to a thousand Kbytes), i'm worried what happens when i've got to deal with objects en masse.

      I am aware of Storable and FreezeThaw, but at least for the time being a relational database is best for me because it allows for robust searching out of the box, and is a platform for modeling relationships between objects.

      That said though, i fully appreciate the warning about digging that hole

        If by efficiency you mean speed the simple fact is that you spend memry to get more speed. Accessing a DB (on disk) is slow compared to accessing data in memory. You can spend memory in two ways. Give your DB huge ammounts of memory to cache disk pages and indices (so it does not have to hit the disks as often) or simply pull all the data you need into memory one pass. With RAM priced at a few hundred bucks a GB and programmer time to code a more memory efficient solution at a similar level per day there is a very good business case for throwing lots more memory at a problem to save code complexity and thus programmer time and debugging time and future maintenance issues.

        Perl loves lots of memory. So do RDBMS. If you are doing much work with either a small investment on more RAM has long term wide ranging benefits.

        cheers

        tachyon

Re: algorithm help for determining efficient db data retrieval
by tilly (Archbishop) on Apr 06, 2004 at 02:14 UTC
    Without extensive analysis and knowledge of your particular usage parameters, no solution is best. There are solutions that are good, but situations can be found where one outperforms another, and vice versa.

    The usual heuristic for this kind of problem is to fetch back well defined and simple sets of properties every time you fetch an object (typically "all") and then cache results (eg with a strategy similar to Memoize) so that you can decide on the fly what objects you don't need to fetch back again.

    If you're fetching lots of things for differing reasons then trying to be really clever about being minimal in what you fetch generally loses to having big simple categories. Plus a KISS strategy takes less work.

    A widely used module that attempts to address this problem is Class::DBI. See the section on Lazy Population.

      The Lazy Population section of the Class::DBI docs was enlightening. Ideally i'd like to be able to hang onto the top x most frequently used objects in memory, but at the moment I live in a cgi world. That will be a possibility when I migrate the platform to mod_perl, but for now I'm stuck with limited-to-no caching between requests.

        Not true. Your RDBMS will cache all the most recently accessed data for you in memory without you lifting a finger. If you feed it more memory it will cache more. Here is a sample my.cnf that we use on servers with 2GB of RAM running squid and apache as well. Compared to the default settings throughput is about tripled for the way we use it.

        [root@devel3 root]# cat /etc/my.cnf [client] socket=/tmp/mysql.sock [mysqld] datadir=/var/lib/mysql socket=/tmp/mysql.sock #set-variable=wait_timeout=3600 set-variable=key_buffer=640M set-variable=max_allowed_packet=32M set-variable=table_cache=512 set-variable=sort_buffer_size=32M set-variable=record_buffer=32M set-variable=read_buffer_size=32M set-variable=myisam_sort_buffer_size=64M set-variable=thread_cache=8 set-variable=query_cache_size=32M set-variable=tmp_table_size=32M [snip]

        cheers

        tachyon

Re: algorithm help for determining efficient db data retrieval
by dragonchild (Archbishop) on Apr 06, 2004 at 03:00 UTC
    I'm confused. This seriously sounds like you're having a bad case of premature optimization. Write the code in the easiest way you have to code it. If that means pulling all the properties for a given object from the database, then do it! If that means pulling entire objects from the database one at a time, then do it! Later, if stuff is slow, then optimize it. But, you don't even know that it will be slow, let alone where!

    As a second note ... your database design sucks. I haven't seen it, I have no idea what it's for, but I know it sucks rocks. How do I know this? Because you have objects and you want to store them in the database. No No NO! This is a "Bad Plan"(tm). Objects are for using. Databases are for storage with referential integrity. They are completely unrelated things. Sure, you may have an INVOICE table and an Invoice object, and they may look like they're 1-1, but this is the exception, not the rule.

    Here's the solution you are looking for (and you've probably done most of these steps, but I'm going to state them anyway, just in case):

    1. Figure out what you're trying to solve.
    2. Figure out what data you need to accomplish #1.
    3. Figure out what algorithms you need to accomplish #1.
    4. Figure out how to store the data from #2.
    5. Figure out how to code up the algorithms from #3. (This may or may not be OO.)
    6. Then, AND ONLY THEN, do you figure out how #'s 4 and 5 map together.

    Let me repeat - your data and your objects are unrelated entities. You fill the latter with stuff from the former. You should not define your data in terms of your objects. That's just asking for trouble.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

Re: algorithm help for determining efficient db data retrieval
by aquarium (Curate) on Apr 06, 2004 at 09:36 UTC
    since any decent database caches previously retrieved data for you, this would be more of a matter of tinkering with database parameters eg. block size, size of cache, index(es). If you override (pre-program) too much caching logic into your program, then you are likely to be working against the very sophisticated caching done by the database system.