Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Archiving data in a database

by SavannahLion (Pilgrim)
on Oct 04, 2004 at 04:42 UTC ( [id://396108] : perlquestion . print w/replies, xml ) Need Help??

SavannahLion has asked for the wisdom of the Perl Monks concerning the following question:

Here's what I have.
I'm creating a new database and the basics are in there. However, one thing I would like to include is a permanent history of changes made to each record. I know that certain databases can maintain their own records, but I have no guarantee that the actual running database software will have this enabled or not, so I need to implement it myself.

Esentially, the data I want to keep is a smaller subset of the records in question. The largest chunk is a medium blob/text field. plus about a dozen or so different related fields.

My question is this... just how should I approach this problem? Just do a straight forward update into a seperate table, and not worry about it or should I be compressing this data using some sort of compression algorithm such as LZ77?
I expect a lot of updates to this archive database, but I don't expect too many queries.

If you need any more details, then feel free to ask. :)

Update: I should add a bit of clarification here. I apologize for any confusion and I really appreciate all of the responses from all directions.

I'm creating a permanent history of changes to records on each transaction. Ergo monthly backups won't be viable unless I capture the changes as well. However, I did neglect to accomodate those monthly backups in case of failure, so I really want to thank pg for mentioning.

I don't have admin access to the database. The actual database itself is maintained by a third entity. I can put in, "requests," to change its running state, but I get no guarantees. In fact, I'm still waiting on a response from the admin on exactly how the database is configured so I can attempt to replicate it as closely as possible on the test system. I think he went on vacation six months ago.

There will be no need to replay the entire archive to recreate the database. In most cases, I'll only need to go back two or three records. I can only think up very few worst case scenarios where I might need to walk backwards through the archives. Any other use of the archive would be purely for reference.

----
Thanks for your patience.
Prove your knowledge @ HLPD

Replies are listed 'Best First'.
Re: Archiving data in a database
by BrowserUk (Patriarch) on Oct 04, 2004 at 05:57 UTC

    A lot will depend upon which type of database (Berkeley type or RDBMS etc,) and which particular one of which ever type you are using.

    With a non-relational DB, there are two approaches:

    1. Backup the entire database (optionally with compression) using standard archiving tools like one of the zipping or arcing utilities.

      You would need to do this during a "regular maintainance period" during which the DB is disabled from use. If your DB isn't very big, a few seconds or even minutes at 3.00 am local time (on Sundays?) is probably acceptable for most purposes.

    2. Alternatively, if you need to keep a transactional record of all the changes, then you would (probably) need to roll your own logging around the interace to the DB.

      This isn't too hard to do if you are using a tied hash interface to this type of DB. You only need write a fairly simple tied hash proxy module that intercepts and forward each access to the DB and logs those operations (STORE/DELETE etc.) that modify it.

    With an RDBMS, most of these will have facilities for logging record changes built in. It is just a case of reading the docs and learning how to enable them (if they are not so by default) and then how to examine and if necessary, recover information from them.

    For example: MySQL has the --log-update[=file_name] (now deprecated) and the --log-bin[=file_name] (prefered) options that you can add to the MySqld startup command line. See the docs for more info.

    The problem then becomes one of deciding:

    1. how much (how long) you need to keep the log data.

      This is a mostly a basic question of "How much disc space am I willing to use for this". Though, in some circumstances--legal & financial records etc--the law may dictate how long you must keep such information.

      This is much simpler these days as given the low cost of optical storage media (CDs and DVDs), it becomes quite easy and very cheap to simply offload your backups on a regular basis and keep them pretty much forever.

    2. what actions you envisage using the archived data for?

      This is one of those questions that you will only be able to answer once you actually have a need for it.

      In theory, if you always have the logging enabled from day one of using a new database, and you keep all the logs, it is possible to "replay" the logs from the beginning, in order, on a restored copy of the bare schema and retrace history back to any point.

      In practice, the longer the database is in use, the longer it will take to get back to a point in the recent past, and the more likely that a log has been lost or corrupted leaving you dead in the water.

      To avoid this, it is best to take regular (incremental) backups of the database itself and archive these as well as the logs. It can save a great deal of time when you need to retrace things to the state a few days ago.

    Finally, the most important two things about backups (DB or otherwise) are

    • That the mechanisms for doing them are in-place, simple (read automated) and easily understood by the new guy a year from now.

      Eg. It's no good having a really sophisticated backup naming scheme if a year from now when the guy that invented it has moved on, and you really need to be able to recover the transactions from a month ago last Thursday, that whomever is charged with the task can work out how to do it.

    • The consistancy (and usability) of the backups should be tested on a regular basis. And that means testing recovering a real piece of data, from a real, live production system--not the test setup!

      I won't recount the whole story, but I was working for a large UK retailer who was spending large sums of money for tape rotation, off-site storage of multi-generational backup tapes for each of its 300+ stores. The day came that disaster struck and it was necessary to replace a stores hardware and recover the DB from tape and they were blank. All of them. A simple miniscule difference meant that whilst the rigourously tested system worked just fine in the test store environment, it wrote completely useless tapes in the production environment. A sum of 5 million wasted over 18 months was mentioned.

    Tests, test harnesses and test environments are fine things to have, but cross verification in a live environment is paramount.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Archiving data in a database
by cleverett (Friar) on Oct 04, 2004 at 05:53 UTC
    The simplest way I can think of:
    1. use 2 tables (assuming mysql here, adjust as necessary):
      create table d_table ( d_table_id integer unsigned auto_increment primary key, table_name varchar(32) not null. d_happened timestamp not null, operation enum('insert','update','delete') not null, index (table_name, d_happened) ); create table d_field ( d_field_id integer unsigned auto_increment primary key, d_table_id integer unsigned not null, field_name varchar(32) not null, old_value text not null, unique index (d_table_id, field_name) );
    2. For table inserts, simply note the insert:
      $dbh->do("INSERT INTO d_table table_name = ?, operation = 'insert'", $ +table_name);
    3. For table updates and deletes, run something like:
      my $old_data = $dbh->selectrow_hashref("select * from table_to_change where primary_key_id = ?", $table_row_id); ## update the table $dbh->do("INSERT INTO d_table table_name = ?, operation = 'insert'", $ +table_name); my $d_table_id = $dbh->selectrow_array("select last_insert_id()"); foreach my $f (@updated_field_names) { $dbh->do("INSERT INTO d_field (d_table_id, field_name, old_value) VA +LUES (?, ?, ?)", $d_table_id, $f, $old_data->{$f}); }
      I leave the changes needed for a delete as an exercise to the dear reader :)
    Using this general scheme you should find it fairly easy to recapture the state of your database at any time during its existence.

    Notes:

    • Though I believe in the general soundness of the above, a bunch of situations exist that could thoroughly ream you, unless you extend the model to cover them:
      • a mysql "ALTER TABLE foo AUTO_INCREMENT = 0", which will reset your row numbers (you can handle this by tracking row numbers for your operations though).
      • Global operations will cost you bigtime, but again you can extend this model to turn them into single operations by using an enum field as a flag.
    • In your shoes, I would be using $dbh->begin_work(), $dbh->commit() and $dbh->rollback() from the Perl DBI library to keep the d_table and d_field tables synced with the rest of your database.
    • If you have to do extensive coding with this discipline, create a subclass of the Class::DBI library to automate the changes to the d_table and d_field tables, instead of coding everything by hand.
Re: Archiving data in a database
by pg (Canon) on Oct 04, 2004 at 04:53 UTC

    I use two approaches at the same time. (Whether this fits you, really depends on your purpose.)

    • I have data warehouse, so "facts" are extracted from the operatinal database periodically, and stored in data warehouse. This is one type of history.
    • On the other hand, I also want to know how those "facts" changing in my operational database. I usually just design some event tables, and record what changes (including insert/change/delete) have been made, when, and made by which user. How detail? as long as I can figure out what has happened, when user questions the correctness of the data.
Re: Archiving data in a database
by astroboy (Chaplain) on Oct 04, 2004 at 05:03 UTC
    I've done something for this in an Oracle database trigger, but I the same approach can be done in Perl if you want to be database agnostic. I set up an audit table with columns like table_name, primary_key (assumes a single column key, but could be modified for compound keys), value_before, value_after, user_name and date_changed. Every time a DML statement is performed, a row is inserted in the database for each column affected in that statement. Naturally, inserts will have no before values, and deletes will have no after values

      Better do it with trigger, so you only code it once. Otherwise the code will be repeated everywhere in your accessing program, much more difficult to maintain. Even if you capture the logic in library, the programmer still migth forget to call it when he should.

      Also trigger is usually faster.

        Only problem is if you are using MySQL, then you won't have triggers until version 5.

        As a side note, I recently had to change an application over to using MySQL instead of Sybase, and I certainly wished the original developer had not bothered with triggers. :-)
Re: Archiving data in a database
by TedPride (Priest) on Oct 04, 2004 at 05:05 UTC
    pg is correct++. The best way to do this is make a backup copy of the database every so often (monthly usually), and record changes since the last backup in a secondary event file, with timestamp and user (and/or IP) fields to show who made the changes. If anyone questions your data, you just go back to the last log showing (according to the person questioning) accurate data, and print out the events for that record or records.