DESCRIPTION

This post describes how to normalize (to Second Normal Form) a table and presents a specific Perl script to migrate the data from the original table into the new tables. The underlying database engine used is mysql.

WHY NORMALIZE?

Because you don't want redundant data in your database do you? This could cause scalability issues down the road, because your database WILL grow in size. More importantly however, is the simple fact that redundancy leads to errors. Imagine storing the name of a course (10,000 of them) along side the student taking the course, then deciding to change the name of that course. If you had built a seperate table of courses and link the students to a 'course id' instead, then this task would be trivial. What if some courses were spelled 'CS101' and others 'CS0101'? Not fun in the former's case!

ORGINAL TABLE

The original table was a test table that i used from time to time. I populated it with info ~~gleemed~~ gleaned (thanks deprecated) from a Perl script via File::Find and MP3::Info. The problem was that I used a database table that is only First Normal Form at best:

# Diagram 1
+-----+-----------------+--------+-----------+------+
| id  | title           | artist | album     | year |
+-----+-----------------+--------+-----------+------+
| 494 | Isobel          | Bjork  | Telegram  | 1996 |
| 495 | Cover Me        | Bjork  | Telegram  | 1996 |
| 498 | Crying          | Bjork  | Debut     | 1993 |
| 499 | One Day         | Bjork  | Debut     | 1993 |
+-----+-----------------+--------+-----------+------+
[download]

First Normal Form requires that that a table have no repeating groups and no multi-valued attributes. Diagram 2 displays a design that breaks First Normal Form:

# Diagram 2
+-----+-----------------+--------+-----------+------+
| id  | titles          | artist | album     | year |
+-----+-----------------+--------+-----------+------+
| 494 | Isobel,Cover Me | Bjork  | Telegram  | 1996 |
| 495 | Crying,One Day  | Bjork  | Debut     | 1993 |
+-----+-----------------+--------+-----------+------+
[download]

Here you see that an album's song titles are lumped together in one field. This is bad because you need more than SQL to seperate the values. This is violating the 'no multi-attributes' rule of First Normal Form.

How about Second Normal Form? Diagram 1 is not in Second Normal Form, which requires that any non-key field be dependent upon the entire key. The name of the album has no dependency upon the artist, also the name of the song has no dependency upon the album. Notice that i say 'name' - i am not saying that a song is not dependent upon the album it belongs to or the artist that the album belongs to - just those extra attributes of a song and an album are not dependent. The key to understanding Second Normal Form is that you will almost always refer to a row by a unique id number that really has no meaning other than to provide uniqueness.

This is the guts of this discussion - migrating a database table from First Normal Form to Second.

ANALYSIS

The first step is to analyze your existing database table. Here is the CREATE statement i used (using mysql):

CREATE TABLE songs (
  id int(11) NOT NULL auto_increment,
  title varchar(65) default NULL,
  artist varchar(65) default NULL,
  album varchar(65) default NULL,
  year varchar(4) default NULL,
  PRIMARY KEY  (id)
);
[download]

I am skipping indexes as they are off-topic. Now, what attributes are NOT dependent upon the primary key? Title refers the song's title, it is dependent. Artist is the name of the artist, it is not dependent. Album is the name of the album, it is not dependent - same for year.

So, our new songs table - which i am going to rename to song - should look like this:

CREATE TABLE song (
  id int(11) NOT NULL auto_increment,
  title varchar(65) default NULL,
  PRIMARY KEY  (id)
);
[download]

But how do you associate a song to the album it belongs to. You need a foreign key. Before i discuss foreign keys, we should back up and approach our new design 'top-down' instead of 'bottom-up'.

So, let's instead start at the top of the heirarchy - artist:

CREATE TABLE artist (
  id int(11) NOT NULL auto_increment,
  name varchar(65) default NULL,
  PRIMARY KEY  (id)
);
[download]

This should be more digestable - all you have is a unique id (that is generated by the database) that uniquely identifies each row, and an attribute to store the name of the artist. You could go one step further and require each name to be unique, but again, that's beyond the scope of this discussion.

The ablum table looks like this:

CREATE TABLE album (
  id int(11) NOT NULL auto_increment,
  title varchar(65) default NULL,
  year varchar(4) default NULL,
  artist_id int(11) NOT NULL,
  PRIMARY KEY  (id),
  FOREIGN KEY index_artist_id(artist_id) 
    REFERENCES artist(id)
);
[download]

Same idea as artist - but this time we have a foreign key (artist_id) to associate an album row to an artist. An artist has albums, an album belongs to an artist. The foreign key artist_id is the value of the id column from the artist table that this row (this album) belongs to. The FOREIGN KEY declaration tells the database that it should check that this value exists before attempting to insert it into the album table. Since mysql still does not support enforced referential integrity, this statement is uneccesary, but other database vendors do support it, so i have including it. index_artist_id is an index that is created as a side effect, but that's the last time i will mention indexes.

Finally, the song table looks like this:

CREATE TABLE song (
  id int(11) NOT NULL auto_increment,
  title varchar(65) default NULL,
  album_id int(11) NOT NULL,
  PRIMARY KEY  (id),
  FOREIGN KEY index_album_id(album_id) 
    REFERENCES album(id)
);
[download]

Very similar to the album table. You could other fields if you like, such as track_time or track_number - these directly relate to a song, so their inclusion will not break Second Normal Form. Including something like songwriter would, however - if the entire band owns the songwriter credits, then all you need to do is include the band id number. To add a finer grain you might need to implement a band_member table or such.

Now that we have our table definitions, lets look at what the contents of each table sould look like:

mysql: select * from mp3.song limit 3;
+----+--------------------------------+----------+
| id | title                          | album_id |
+----+--------------------------------+----------+
|  1 | Until the End of the World     |        1 |
|  2 | So Cruel                       |        1 |
|  3 | Tryin to Throw Your Arms Aroun |        1 |
+----+--------------------------------+----------+

mysql: select * from mp3.album limit 3;
+----+--------------+------+-----------+
| id | title        | year | artist_id |
+----+--------------+------+-----------+
|  1 | Achtung Baby | 1991 |         1 |
|  2 | War          | 1983 |         1 |
|  3 | October      | 1981 |         1 |
+----+--------------+------+-----------+

mysql: select * from mp3.artist limit 3;
+----+-------------+
| id | name        |
+----+-------------+
|  1 | U2          |
|  2 | Bjork       |
|  3 | Pink Floyd  |
+----+-------------+
[download]

Notice how you can track the song 'So Cruel' to it's album (Achtung Baby) and (via the album table) to it's artist (U2):

mysql: select artist.name from artist 
       inner join album on artist.id=album.artist_id
       inner join song on album.id=song.album_id
       where song.title='So Cruel'; 
+------+
| name |
+------+
| U2   |
+------+
[download]

MIGRATION

Now that we see where we are going - let's get there! The idea is to get all the data from the existing First Normal Form table, munge it, then insert the data into the appropriate new Second Normal Form tables.

So, how do we munge the existing data? Refering back to Diagram 1, if we could build a data structure like this:

$VAR1 = {
   'Bjork' => {
      'Debut' => {
         'list' => [
            'Crying',
            'One Day',
            'Come To Me',
            'Violently Happy'
         ],
         'year' => '1993'
      },
      'Telegram' => {
         'list' => [
            'Isobel',
            'Cover Me',
            'Army Of Me',
            'Possibly Maybe'
         ],
         'year' => '1996'
      }
   }
   'U2' => { 
      #etc...
   }
};
[download]

then we could first grab the first key and insert that into the artist table. Then we find the unique id that was assigned to the newly inserted row and use that as the foreign key for the albums. We do the same for each album, except we save the list of songs until after the new album has been inserted. Same thing - grab the newly assigned unique id and us it for the foreign key for the songs.

Confusing? How about some code? My database name is mp3.

use strict;

use DBI;

# connect
my $dbh = DBI->connect(
   qw(DBI:mysql:mp3:host user pass),
   { RaiseError => 1 }
);

# get existing data as one big ole 2-d array
my $songs = $dbh->selectall_arrayref("
   select artist,album,title,year
   from songs
");

# munge data into new data structure
my %hash;
foreach my $row (@$songs) {
   my ($artist,$album,$title,$year) = @$row;

   # auto-vivification is what makes this all so cool
   push @{$hash{$artist}->{$album}->{'list'}},$title;

   # getting the year is tricky - since i know they
   # are all the same - i can safely or-cache it ;)
   $hash{$artist}->{$album}->{'year'} ||= $year;
}

foreach my $artist (keys %hash) {

   # insert this artist - get unique id for albums
   $dbh->prepare("
      insert into artist(name) 
      values(?)
   ")->execute($artist);
   my $artist_id = last_id();

   foreach my $album (keys %{$hash{$artist}}) {
      my $year = $hash{$artist}->{$album}->{'year'};

      # insert this album - get unique id for songs
      $dbh->prepare("
         insert into album(title,year,artist_id)
         values(?,?,?)
      ")->execute($album,$year,$artist_id);
      my $album_id = last_id();

      foreach my $song (@{$hash{$artist}->{$album}->{'list'}}) {

         # insert this song
         $dbh->prepare("
            insert into song(title,album_id)
            values(?,?)
         ")->execute($song,$album_id);
      }
   }

}

$dbh->disconnect;

# mysql specific - other vendor's milleage will vary
sub last_id {
   $dbh->selectall_arrayref("
      select LAST_INSERT_ID()
   ")->[0]->[0]
}
[download]

And that's it!

CONCLUSION

Migrating a First Normal Form database table to multiple Second Normal Form tables is never a generic task - the techniques vary from table to table. The method i chose to derive my new tables was very specific to my needs, and as a result, my code 'as is' is only useful to these specific tables.

I should also note the normalization is not the ultimate solution for every database. Normalization forces JOIN's to be used in order to retrieve the equivalent row of the orginal table, and JOIN's can be quite expensive. In the real world, you will find many databases that have been DE-normalized in order to speed up certain critical queries. To tell the truth, most databases i have seen in the real world aren't even normalized in the first place, probably because JOIN syntax is complicated and intimidating. Compare these two queries:

# orginal First Normal Form table
mysql: select artist,album,title,year 
       from songs 
    order by artist,year,album,title 
    limit 2;
+----------------+-------------+-----------------+------+
| artist         | album       | title           | year |
+----------------+-------------+-----------------+------+
| 10,000 Maniacs | In My Tribe | A Campfire Song | 1987 |
| 10,000 Maniacs | In My Tribe | Cherry Tree     | 1987 |
+----------------+-------------+-----------------+------+
2 rows in set (0.02 sec)

# new Second Normal Form tables
mysql: select artist.name, album.title as album,
              song.title, album.year
       from artist
       inner join album on artist.id=album.artist_id
       inner join song on album.id=song.album_id
       order by artist.name,album.year,album.title,song.title 
       limit 2;
+----------------+-------------+-----------------+------+
| name           | album       | title           | year |
+----------------+-------------+-----------------+------+
| 10,000 Maniacs | In My Tribe | A Campfire Song | 1987 |
| 10,000 Maniacs | In My Tribe | Cherry Tree     | 1987 |
+----------------+-------------+-----------------+------+
2 rows in set (0.04 sec)
[download]

The resulting SQL is ~~much~~(update: i should have been more careful with my words) a bit more complex, and the results are even a bit slower. Before you go off thinking that normalization is worse, see dws's excellent rebutal below. Also realize that with normalization, you can get at the pieces easier - what if you misspelled Bjork?

# 1st form
mysql: update songs set name="Bjork"
       where name="Bork";

Rows matched: 14  Changed: 14

# 2nd form
mysql: update artist set name="Bjork" 
       where name="Bork";

Rows matched: 1  Changed: 1
[download]

The first example could be 14 rows or 14,000,000 rows. The second example will always be 1.

There are many, many papers, tutorials, discussions, etc. available on the Web - here is a resource that i found particularly useful: http://www.palslib.com/Fundamentals/Database_Design.html.

Good luck!

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
F--F--F--F--F--F--F--F--
(the triplet paradiddle)

Comment on Migrating a 1NF table to multiple 2NF tables Select or Download Code

Replies are listed 'Best First'.
Re: Migrating a 1NF table to multiple 2NF tables by dws (Chancellor) on Dec 05, 2001 at 02:06 UTC
To this otherwise informative post, I take one exception: Normaliztion forces JOIN's to be used in order to retrieve the equivalent row of the orginal table, and JOIN's can be quite expensive. In the real world, you will find many databases that have been DE-normalized in order to speed up certain critical queries. To tell the truth, most databases i have seen in the real world aren't even normalized in the first place, probably because JOIN syntax is complicated and intimidating. JOINs can be expensive, but they can also improve performance. I once got a 15x increase in performance by turning a 6-way join into a 7-way join. It was originally 6-way because the author had bought into the superstition that "JOINs are bad and Are To Be Avoided". He thought he was doing good by minimizing JOINs, but he hadn't taken the time to dig further to understand query planning. Modern RDBMSs are pretty good when it comes to (internally) optimizing the order of JOINs when the fields you are joining on are indexed. And some RDBMSs will even optimize to handle cases where you're joining against a table that requires a linear scan. It pays to understand JOINs. The syntax isn't really that complicated, and the payoff you get in data integrity by using a normalized form (2NF or 3NF) to avoid duplication can be a really big win. Really big shops separate logical database design from physical design. Physical design involves things like deciding how to physically partition the database so that files involved in performance-critical joins live under separate disk heads (since moving disk heads is a relatively expensive operation). One way to measure the maturity of a shop is to look at what kind of physical design they've done. If their tables, temp space, and log space are on the same drive, they're probably not very mature (or performance isn't an issue).	[reply]
Re: Re: Migrating a 1NF table to multiple 2NF tables by uwevoelker (Pilgrim) on Dec 06, 2001 at 04:20 UTC
Hello dws, can you please tell me, where I can learn more about JOINs? Thank you very much, Uwe	[reply]
Re: !--info--Migrating a 1NF table to multiple 2NF tables by Arguile (Hermit) on Dec 05, 2001 at 15:58 UTC
To simplify your joins you may want to consider using the `WHERE a.pk = b.fk` syntax. It's often much clearer than the oft convoluted `JOIN ... ON`. `-- I took the liberty of using table aliases as well SELECT a.name, b.title, s.title, b.year FROM artist a, album b, song s WHERE a.id = b.artist_id AND b.id = s.album_id ORDER BY a.name, b.year, b.title, s.title LIMIT 2;` [download] In some databases this can lead to much faster queries as well. In these cases, by not explicitly stating the join order, the query optimiser can step in and decide based on indices and other factors. Another thing of note is that as of 3.23.44 MySQL does support foreign key relation(finally!) a bit in the InnoDB table type. If you are using MySQL as a relational database I urge, no implore, you to update and use said table types. Without such contraints (and fully functional transaction, but we'll ignore that for now) you're really not even using a relational database. As an added incentive they'll vastly improve the robustness of your application and remove a lot of (what should be) uneccassy work at the application level. (And yes I do froth at the mouth a bit when expounding on the evils of bad, or the lack of, data-integrity checks ;) To create an InnoDB table in MySQL simply append 'TYPE=INNODB' to all your CREATE TABLE statements. `CREATE TABLE parent(id INT NOT NULL, PRIMARY KEY (id)) TYPE=INNODB;` [download] See here for more details on the syntax and restrictions.	[reply] [d/l] [select]
Re: Re: !--info--Migrating a 1NF table to multiple 2NF tables by extremely (Priest) on Dec 07, 2001 at 09:51 UTC
While table aliases are good for people's sanity, I have to object to your suggestion not to use JOIN. Lazy or Implied JOINs are a bad habit. If someone is comfortable with it you shouldn't chide them. Not only is good to know the proper way once you get deep enough to want RIGHT OUTER JOINs and such but the syntax will rescue you from the dreaded "dot product bomb". See, without a join constraint, SQLs only choice is to join the first row of one table to every row in the next, and then the second row etc etc. A three table join of 10,000 records in each table, each a tiny 100 bytes will return at least 100,000,000,000,000 bytes of data plus framing and protocol and such. People rarely forget the ON xxx=yyy clause off the end of the JOIN but often when tinkering or generating code you wind up with an incomplete WHERE. I've personally seen a perl script kill a $200,000 SGI box with 1.5GB of total memory. Just shut it down, ate every drop of memory and was still trying when we finally got a root terminal to except "killall perl<RET>" at the blazing speed of 1 char every 5 seconds. =) When speed of data return matters, and the optimizer fails you, at least you'll be able to reorder the JOINs without refactoring the whole deal. And about 90% of the time, the natural order you follow will be the right thing. As to the MySQL notes on keys and the need for constraints I say a hearty Amen Brother! -- $you = new YOU; honk() if $you->love(perl)	[reply]
Re: !--info--Migrating a 1NF table to multiple 2NF tables by Rockafeller (Scribe) on Dec 06, 2001 at 22:20 UTC
I have found that database normalization is a hard thing to learn/teach, but it is very important to know when creating databases. Here is a power point normalization.zip from a college database design class I took that is probably the best tutorial I have seen. --Tom	[reply]
Re: Migrating a 1NF table to multiple 2NF tables by rob_au (Abbot) on Jun 07, 2004 at 10:04 UTC
There is a new article on mysql.com which also provides readers with an introduction to database normalization here. `perl -le "print unpack'N', pack'B32', '00000000000000000000001011100010'"`	[reply]


Think about Loose Coupling
	PerlMonks

Migrating a 1NF table to multiple 2NF tables