Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

database strategy

by aufrank (Pilgrim)
on Aug 02, 2002 at 23:24 UTC ( [id://187248] : perlquestion . print w/replies, xml ) Need Help??

aufrank has asked for the wisdom of the Perl Monks concerning the following question:

hey all--

I'm working on some more DBI stuff and have kind of hit a wall. The stopping point is not perl, though, it's the way the databases I'm working with were set up. A brief explanation:

  • four fields: id, last_name, first_name, address
  • id is guaranteed to be unique, not guaranteed to be defined
  • last_name and first_name are guaranteed to be defined, but not to be unique
  • address is probably defined and is probably unique
  • the problem is that I have many different tables, and need to try to bring the information together, but cannot afford to match the wrong information with the wrong person. if I use the id field to identify and compare records across tables, I'll get correct matches but will not process all the records. if I use some combination of names and address, I risk incorrectly matching records where the names are the same and the addresses aren't defined.

    the first reply below is my implementation of a system that matches entries from different tables by id if possible, and if not, by names and address. (thought I should put it in a reply to keep it from cluttering the SOPW page).

    I guess I have three questions: 1) is there a better general strategy to deal with the problem? 2) is the code I have included an effective way of implementing the solution I've suggested, and 3) what suggestions should I make to the people that actually do the db design that might keep this sort of thing from happening with future tables that are created?

    if you feel that the real issue is simply a failure on my part to comprehend something basic and important, please include a link to someplace I can read up on it-- I'm fully aware of just how ignorant I may very well be, but not quite sure what I'm ignorant of! :D

    please do take a look at the code below,

    Replies are listed 'Best First'.
    Re: database strategy
    by Zaxo (Archbishop) on Aug 03, 2002 at 01:33 UTC

      For database design in general, see google database normalization, that lists plenty of tutorial sites.

      You said on cb that the table you're talking about was constructed without a primary key (a column of unique non-null values) The id column should be it. Realize that this design error may not have a complete solution, and in fact you should hope that the rest of the tables are badly normalized. That will help you because replicated data may give enough clues to reassociate the data to the correct person.

      Your strategy for doing that looks reasonable, but I would extract the known good data first, making new tables of everything that has a sane id and of all the other tables' records that are associated to uncorrupted accounts. After deleting the good records from the old tables, then start trying to match the corrupt accounts with the leftovers in the other tables. Ultimately, you may need to contact some people directly and get them to identify transactions. That will be embarassing (one hopes to the designers of that mess).

      Your new tables should each have a primary key, and that key should be the only thing used to refer to a record in another table. Consider autoincrement fields for primary keys. Make sure all the other tables have good primary keys. Do not use timestamps for that.

      After Compline,

    Re: database strategy
    by Ryszard (Priest) on Aug 03, 2002 at 16:31 UTC
      To expand on Zaxo's point a little, if (when you build a new set of tables) you cannot easily define a "natural" primary key, it is perfectly acceptable to create an "artificial" key. The best method of doing this is with a "sequence" in your database.

      A hint with normalisation: If you have more than one person at an address, you may create two tables:

      1. Names
      2. Addresses
      You then may include an "address_id" (primary key from Addresses) column in your "Names" table, then if there are > 1 person at an address, you would put in the address_id, rather than the entire address detail.

      This also has the advantage of flexibility. For example, if for some reason the address changes from street to road, you only have to update it in one spot, not two!

      If you wanted to get right into normalisation, the basic idea is to not replicate the same data in more than one spot. For example, you may create a table that has all the different types of streets (street, road, place etc etc), then add that primary key into the address table.

    Re: database strategy
    by Cine (Friar) on Aug 03, 2002 at 01:12 UTC
      I would suggest that you do it via two sql's
      SELECT * FROM table a,table b WHERE = AND IS NOT NULL A +ND != ''; and SELECT * FROM table a,table b WHERE a.first_name = b.first_name AND a. +last_name = b.last_name AND ( IS NULL or = '') AND ( IS NULL or = '');

      T I M T O W T D I
    Re: database strategy
    by aufrank (Pilgrim) on Aug 02, 2002 at 23:28 UTC