Database normalization the easier way

Most distinguished Monks,
A recent node (thanks, jeffa) offered a very interesting theoretical (and practical) view into database normalization.
While I couldn't agree more on the benefits of normalization, I was wondering if there was a different way of tackling the problem from the practical side.
Of course, normalization is never going to be "easy." But it could be easier than building a specialized solution for each table.
Due to previous experience with these tasks (which I used to solve in C - don't shoot!) I felt that a more general solution is possible, and fiddling around with Perl I came up with a module that can reduce our normalization efforts to something like the following:

#!/usr/bin/perl -w
use strict;

use Normalizer;
my %parameters = (
   DSN           => "DBI:mysql:music;host=localhost;"
                    . "mysql_read_default_file=$ENV{HOME}/.my.cnf", 
   src_table     => "MP3",
   index_field   => "album_id",
   lookup_fields => "artist,album,genre",
   lookup_table  => "tmp_albums", 
   dest_table    => "songs",
   copy_indexes  => 1
);
my $norm = Normalizer->new (\%parameters);
$norm->do();
[download]

The more adventurous could also try a one-liner (Normalization Golf?):

perl -e 'use Normalizer; Normalizer->snew(qw(localhost music \ 
    MP3 album_id album,artist,genre tmp_albums songs 1 0 0))->do()'
[download]

What happens with this script?
Having this initial data (database: music, table: MP3),

 +----------+-------------+------+-----+----------+----------------+
 | Field    | Type        | Null | Key | Default  | Extra          |
 +----------+-------------+------+-----+----------+----------------+
 | ID       | int(11)     |      | PRI | NULL     | auto_increment |
 | title    | varchar(40) |      | MUL |          |                |
 | artist   | varchar(20) |      | MUL |          |                |
 | album    | varchar(30) |      | MUL |          |                |
 | duration | time        |      |     | 00:00:00 |                |
 | size     | int(11)     |      |     | 0        |                |
 | genre    | varchar(10) |      | MUL |          |                |
 +----------+-------------+------+-----+----------+----------------+
[download]

Here are the instructions produced by the above lines of perl:

    DROP TABLE IF EXISTS tmp_albums;
      # create the lookup table
    CREATE TABLE tmp_albums 
        (album_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, 
        artist varchar(20) not null,
        album varchar(30) not null,
        genre varchar(10) not null, 
    KEY artist (artist), 
    KEY album (album), 
    KEY genre (genre));

    # populate the lookup table
      INSERT INTO tmp_albums 
    SELECT DISTINCT NULL, artist,album,genre FROM MP3;

    DROP TABLE IF EXISTS songs;
    # create the destination table
      CREATE TABLE songs 
        (ID int(11) not null auto_increment,
        title varchar(40) not null,
        duration time not null default '00:00:00',
        size int(11) not null, 
        album_id INT(11) NOT NULL, 
    PRIMARY KEY (ID), 
    KEY title (title), 
    KEY album_id (album_id));
 
      # Here is the trick! Using the lookup fields
      # as foreign keys, we populate the destination
      # table from source_table JOINed to lookup_table 
    INSERT INTO songs SELECT src.ID, 
    src.title, src.duration, src.size, album_id 
    FROM MP3 src INNER JOIN tmp_albums lkp 
        ON (src.artist =lkp.artist and src.album =lkp.album 
            and src.genre =lkp.genre);
[download]

Don't rush to the CPAN ;-). The Normalizer module is not there (yet) also because I don't know if I should ask for a standalone namespace or under DBIx:: (any piece of advice here will be more than welcome). It is not a short piece of code either. It is 968 lines (65% of which are documentation) and I didn't feel like posting all of it in this node.
But you are welcome to have a look at it from the following addresses:
Normalizer.pm (highlighted syntax - 128 KB)
Normalizer.pm (plain script - 38 KB)
Normalizer.pod (documentation - 30 KB)
Normalizer-0.05.tgz (complete package - 33 KB)

update 2-Feb-2002
Now in the CPAN as DBSchema-Normalizer

The black magic behind this code is more SQL than Perl. However, Perl makes it easier to collect the necessary pieces of information from the database engine and create the SQL statements. More important, Perl makes a generalized solution feasible.
A complete explanation of the algorithm is in the module documentation. The basic concept is to let the database engine work the heavy load, while Perl is directing the operations without wasting any valuable resources.
This module deals only with MySQL databases, but the principle should be valid for any RDBMS. If you want to try it risk-free, it is possible to run the script in "simulation mode," producing the SQL without executing it.
I hope this is going to be helpful, and I will be glad to receive your comments.

 _  _ _  _  
(_|| | |(_|><
 _|

Comment on Database normalization the easier way Select or Download Code

Replies are listed 'Best First'.
Re: Database normalization the easier way by IlyaM (Parson) on Dec 17, 2001 at 17:39 UTC
I would not use standalone namespace for module which basically does just one thing. IMHO your module belongs to DBIx:: It would be cool if it were not used MySQL specific code. There exist some modules which allow to do database schema operations in more portable way: Alzabo, DBIx::DBSchema. Anyway idea is very intersting. It is sad I have no votes left :(. -- Ilya Martynov (http://martynov.org/)	[reply]
Re: Re: Database normalization the easier way by gmax (Abbot) on Dec 18, 2001 at 18:54 UTC
My initial idea was to ask for a DBI:: namespace, but yesterday I found out that such namespace is restricted. Then DBIx:: is a good candidate (Or it should be as soon as I get an answer from CPAN. Being in a monastery, I should become used to being patient {grin}). DBIx::DBSchema was a good hint. I checked it and it seems to provide enough information to replace the direct calls I am using so far. I would need to rewrite the two subs that are dealing with column information (so many greps and maps wasted!) but this is a good chance to make the module portable across databases. Thanks. gmax	[reply]
Re: Database normalization (in CPAN!) by gmax (Abbot) on Feb 02, 2002 at 15:31 UTC
After some discussion in the `modules@perl.org` list, I accepted Tim Bunce's suggestion to put it under a new namespace, DBSchema::. Now the module is in the CPAN as DBSchema::Normalizer. Thanks to all the Monks for their valuable help. _ _ _ _ (_\|\| \| \|(_\|>< _\|	[reply] [d/l]
Re: Database normalization the easier way by zakzebrowski (Curate) on Dec 17, 2001 at 18:44 UTC
Darn cool if I don't say so myself...++ Update: I don't see the reason to give -- to any node that says the parent node is cool... I would personally like that kind of feedback when I write something other than no comment at all. ---- Zak	[reply]


Perl Monk, Perl Meditation
	PerlMonks