RDB updates: rebuild from scratch vs. incremental

tlm has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I'm in the early stages of designing a system for collecting data from various disparate sources, processing it, and storing the results in a RDBMS (most likely PostgreSQL, but possibly MySQL). I'm very new to the brave new world of RDBs; I'd appreciate your comments on the following.

The primary sources for this system (which are entirely outside our control) are updated frequently, so we want to have regular (monthly?) updates for our derived database.

I'm debating between two different approaches to this updating. One is to re-build the database from scratch once a month. The other is to do the updates incrementally.

For the first approach, which I find by far the simplest, I was thinking to have a Perl script build an intermediate SQL file, analogous to a DB dump, and then have Pg read that in (I've found that this two-stage method is much faster than building the database directly from a Perl script via DBI+DBD::Pg). The one concern I have about this is that, most of my tables have serial fields that have no intrinsic meaning, and whose sole purpose is to relate tables to each other. Re-building from scratch every month would replace all these fields with new values. In other words, the values of these fields would not be meaningful across builds. Would this be a problem down the line?

The second approach, incremental updates, I find formidable, to be honest. It sounds reasonable at first blush; after all, why rebuild an entire database when only 1% of it has changed since the last update? This reasoning would be on target if the updates were always additions, but that's not the case: updates will also include deletions and modifications. So an incremental update would have to compare each record in the old version with each record in the new one, and somehow detect when records have been deleted. I suppose there may be situations in which this approach, as painstaking as it sounds, may be faster than a clean rebuild, but whatever they are I can't see them in the horizon. Moreover, I don't know how to do this other than through a Perl script, which I expect to be much slower than the two-stage clean rebuild described above. But again, I'm very green at this, so please feel free to bonk me with the clue-by-four.

I realize my questions here are very vague. That's another symptom of my cluelessness on this.

the lowliest monk

Comment on RDB updates: rebuild from scratch vs. incremental

Replies are listed 'Best First'.
Re: RDB updates: rebuild from scratch vs. incremental by danmcb (Monk) on Aug 30, 2005 at 13:37 UTC
definitely incremental - especially if the events that cause you to update your data take place incrementally. Doing the update as events arrive distributes the processing workload over time, and makes you design a system that has some robustness. Think about what those evenst are, how they are signalled to your system, and what the change on your data is for different events. Work this out with pencil and paper as far as you can. Work out what it is you don't know (not so much about the technical side, more about the proper behaviour of the system). What inconsistent states can arise as a result of unexpected events? How will you hanlde them? What are the invariants of the state of your system? Design database queries that update you db for expected events. If updates affect several tables (ie to preserve the links between tables that you mention, which by the way are a GOOD thing generally), find out how to lock tables so that the data accessed by users will always be consistent. DO take the time to make sure that your database design is good, and correctly normalised. Not doing this will cost you many many hours down the line. If you don't know what this means, check out some of the many tutorials on teh web	[reply]
Re: RDB updates: rebuild from scratch vs. incremental by jfroebe (Parson) on Aug 30, 2005 at 13:20 UTC
Hi, Performing incremental changes of your data is generally preferred over rebuilding the database every month. During the recreation of your database, your database would be unavailable for your users. There are several good books on both Postgres and MySQL... my recommendation is to take a weekend and read one of the books. It would be good if you were to pick up a book on database design, else you will likely wish to redesign the database later on. hope this helps :) Jason L. Froebe Team Sybase member No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1	[reply]
Re^2: RDB updates: rebuild from scratch vs. incremental by jhourcle (Prior) on Aug 30, 2005 at 15:27 UTC
During the recreation of your database, your database would be unavailable for your users. Depending on how you rebuild your data, that may not be true. I reload a number of tables every morning, and the process takes about 25 minutes for me to acquire the data, and do the pre-processing for the application. For each table, as I am processing it, I first remove tables named `backup_$name`. I next load the data into `temp_$name`, and perform whatever processing needs to be done, such as sanitizing the data, generating derived values, and indexing. Once all of the tables are loaded, I loop through the list of tables and rename `$name` to `backup_$name` and rename `temp_$name` to `$name`. Although there's added load on the database every morning (both the one I'm exporting from, an the one that I'm importing to), both databases remain available through the process (I've yet to have anyone complain about issues when the tables are getting renamed, but there is that possibility ... it's still significantly less than the time that it takes to recreate the database). In my situation, I have the luxury of disk space -- it's possible that someone with a larger database may not have that. (I keep the backups not just for the ability to roll back, but also to make sure that other folks on the machine don't try to use the space that I'm going to need for the reprocessing) I'll go back to one my overused mantras -- There is no one best solution for every scenario. There are ways of doing this with either incremental updates, or a complete rebuild. Without knowing more about the nature of the data, I don't think it'd be good to recommend either one, because this data may be completely different from everything that I've ever worked with.	[reply]
Re^3: RDB updates: rebuild from scratch vs. incremental by jfroebe (Parson) on Aug 30, 2005 at 17:03 UTC
I agree :) It would also depend on the chosen DBMS on how much time the tables would be inaccessable to the users. In your case, it is probably less than a second. On low utilized servers, this is probably more than acceptable.. in other scenerios, the loss of a half second could amount to be thousands of dollars (trading systems for example). Definitely, the possible solutions will vary :) Jason L. Froebe Team Sybase member No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1	[reply]
Re: RDB updates: rebuild from scratch vs. incremental by diotalevi (Canon) on Aug 30, 2005 at 13:17 UTC
Here's something even easier. Do your difference detection outside of Pg. Save the previous version of your data and compare it with the latest using something apropos like diff(1) or Algorithm::Diff. You can use that to make just the specific changes required to keep your database synched up.	[reply]
Re: RDB updates: rebuild from scratch vs. incremental by anonymized user 468275 (Curate) on Aug 30, 2005 at 15:01 UTC
Although the updates should be performed incrementally, they need to be calculated in bulk. The approach I usually take to this is as follows: Process the files from other sources into a load format that suits the DBMS you choose. You might have a directory for this named by the download date (most operations go for daily upload from other sources rather than monthly, to avoid delay while processing a big backlog and to stay up to date) These files should therefore be one file per table, PK columns in their proper order and sorted for the next step: To calculate an incremental load, use the unix command com command with option -23 for deletes and -13 for inserts (the updates will then be represented as a delete and an insert) e.g. for the deletes: `for ( glob path/YYYYMMDD/* ) { my @split = split( '/' ); my $file = pop @split; open my $com, "com -23 $_ path/$prv_bus_day/$file \|"; for my $delrow ( <$com> ) { # build the delete statement from $delrow } close $com; }` [download] To build the delete statements, you will need to query the metadictionary of the database to get its PK which has to be matched with the columns retrieved from the above pipe. One world, one people	[reply] [d/l]
Re: RDB updates: rebuild from scratch vs. incremental by polettix (Vicar) on Aug 30, 2005 at 15:22 UTC
In other words, the values of these fields would not be meaningful across builds. Would this be a problem down the line? I fear yes. Unless you're doing a simple store-and-peek system, if you want to build derived tables that refer to the original data you'll be losing the derived data at each step (or you'd be forced to rebuild that as well). Of course, this is in addition to the considerations above, in particular the fact that you would interrupt service with the first approach. The whole point of the RDBMS should be that you can work on the data with the least interference on what the other users are doing. Flavio perl -ple'$_=reverse' <<<ti.xittelop@oivalf Don't fool yourself.	[reply]


Perl-Sensitive Sunglasses
	PerlMonks