Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

database design issue: one big table vs hundred of smaller tables

by AltBlue (Chaplain)
on Nov 26, 2001 at 06:50 UTC ( [id://127470]=perlmeditation: print w/replies, xml ) Need Help??

this is quite an off topic subject as it is in fact just a database design question, but i'm sure there sure are some DBMS specialists among us that maybe could and would illuminate me :)

i'm working on a 'virtual catalog' project that requires building some kind of 'directory tree' inside a DB. There are stored 'categories' and 'items'. Each 'item' has a number of attributes, different for each category.
here is an illustration that should explain it all:

base/
   persons/
      kids/
        jim        #   name, surname, birth, mom_name, pap_name, 
        tim        #   sister_name, brother_name, ....
        john       #
      grown_ups/   
        bob        #   name, surname, birth, salary, company,
        rob        #   phone_no, fax_no, email, homepage, ...
   books/
      manuals/
        linux        # pages_no, authors, edition, cover_photo,
        math         # editure, pub_year, ....
        geography    #
      dictionaries/
        greek        # terms_no, editure, editure, pub_year ...
        latin        #
        arabic       #
      sf/
        dune         # pages_no, volumes, editure, author, pub_year
        foundation   # prizes, rating, ...
        ubik         #
   computers/
      workstations/ 
        john's       # owner, os, monitor, video, ram, ip, location,
        mine         # hdd, cpu, keyb, mouse, ...
        his          #
      servers/       
        alpha        # location, ip_list, os, ram, cpu, hdd, services
        beta         # ...
        gamma        #

ok, so I have all these different 'things' that should live together in the same eclectic catalog. I want to be able to do complex but still fast searching all around this db. The scalability range I would like to be somewhere near thousands of categories and thousands of unique items in each category.

how to organize it? As I see it at this moment there are two main solutions:

  • 2 big tables (1 categories, 1 items)
  • many tables (1 table with categories and a separate table for each category)

What would be the best database design for such a thing, keeping in mind that searching through this db should be as fast as possible? Also remember that attributes may differ a lot in number (and size).

The first solution I think it's crappy and wouldn't like to go on with it. The second is the one that I have already implemented and the results are pretty good until now, but I haven't yet had the time to build up some real stress test scenarios so I don't know for now what could happen when I would reach to 1K categories and 1K items in each category =)

heh, I haven't mentioned the DBMS because I wouldn't like to get some particular solutions optimised for xSQL or ySQL ;-) ... yet, ofc, notices regarding implementation flaws on i.e. mySQL or PostgreSQL or whatsoever are welcome

--
AltBlue.

  • Comment on database design issue: one big table vs hundred of smaller tables

Replies are listed 'Best First'.
(Ovid) Re: database design issue: one big table vs hundred of smaller tables
by Ovid (Cardinal) on Nov 26, 2001 at 07:41 UTC

    I don't see that two tables are going to be a truly viable solution. Your items have sufficiently unique and non-overlapping attributes that trying to stuff them all into one huge table is going to wind up with many items having a lot of blank fields. You're just begging for trouble with that. To distinguish what items have what attributes, you'd either have to create an "itemAttribute" table, describing what items in what categories have what attributes (and that's going to complicate your SQL), or you'll need to move that information into your code which will then require code updates to be more frequent as you're forced to synchronize the code and the database with subsequent updates.

    I'd probably start by building a table similar to the following:

    nodeID parentNodeID Description ------ ------------ ----------- 1 1 persons 2 1 kids 3 1 grown_ups 4 4 books 5 4 manuals 6 4 dictionaries

    You can make the nodeID the primary key and have a foreign key constraint from the parentNodeID to the nodeID. To find all root nodes:

    SELECT nodeID where nodeID = parentNodeID

    That's fairly easy to drill down in and with proper indexing, should be simple to search. It gets a little more complicated with items because you can't just stick a nodeID in items and then select all items with the appropriate nodeID. Perhaps you could create a meta-table listing the table names of the items you need and create lookup tables with that (mind you, this is all off the top of my head).

    Obviously, you want this to be as fast as possible, but I would strongly caution you against designing this system with speed in mind. You want your database schema to be correct. If the schema is poor, you're much more likely to suffer corruption from anomalies. If that starts to happen, you either have to redesign your database correctly, or put in a bunch of extra code to prevent the anomalies. The latter option, of course, is going to have a performance impact, so you'd wind up losing what you might think you gain by trying to optimize up front.

    Prematurely optimizing is a common problem. Anyone with any experience with databases knows the value of a properly normalized schema. Since your application has not been written yet, you can't judge whether or not performance is going to be a problem. Why trade the known benefits of normalization with the unknown benefits of optimization when you don't even know if performance is yet an issue?

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

      Thank you all for your replies, they offered me some really good points to start digging on. I haven't posted my actual database structure in my first note on this topic so here it is:

      -- 'attrs' keeps the attributes list in CSV form CREATE TABLE categories ( id INT UNSIGNED NOT NULL UNIQUE AUTO_INCREMENT, name CHAR(255) NOT NULL UNIQUE DEFAULT '', parent INT UNSIGNED NOT NULL DEFAULT '', attrs CHAR(255) NOT NULL DEFAULT '', time TIMESTAMP, PRIMARY KEY(id), KEY(name(16)), KEY(parent) ); -- _[id] is the id of the category as existent in 'categories' table CREATE TABLE cat_[id] ( id INT UNSIGNED NOT NULL UNIQUE AUTO_INCREMENT, name CHAR(255) NOT NULL UNIQUE DEFAULT '', time TIMESTAMP, ... attribute ... attribute ... attribute PRIMARY KEY(id), KEY(name(16)) ); -- bloat table to speed up common search patterns CREATE TABLE items_all ( id INT UNSIGNED NOT NULL UNIQUE AUTO_INCREMENT, name CHAR(255) NOT NULL DEFAULT '', category CHAR(255) NOT NULL DEFAULT '', cat_id INT UNSIGNED NOT NULL, PRIMARY KEY(id), KEY(name(16)), KEY(category(16)), KEY(cat_id), );
      As you see, it's quite a clean/lame implementation of the second variant, as stated in my previous post. footpad suggested a 'fields/attributes' table instead of my 'attrs' column in the categories table. It should be an improvement for a large number of attributes, but I'm thinking there will be no such category with more than let's say 20 attributes for each category.
      Normalization is a good thing, not that I was not thinking about it, but the problem at the first (the second too :) glance is that it can't be no normalization as I want to keep the bazaar feature ;-) meaning that this system should be able to store people as well as books, tools, flavours, images or anything else disregarding any resemblance between categories. In fact the only relation that I'm thinking to enhance somehow is the one of father-child category: as many of you have noted, there could be many overlapping attributes in some cases (humans-:grown-ups-:artists) so maybe there some normalization should be considered. Heh, some OOP techniques would be better in these cases as inheritance. Even more, polimorphism would be needed if i.e. would like to move around items from one category to another (i.e. 'musicians' from 'humans/grownups/artists/' in 'music/') :)
      Heh, it will get so complex that I think that KISS should come up.For speed up purposes there may be used additional redundant databases containing specialized table in various things etc (as my all_items table).
      I have already used Swish and maybe I will use it for this project too. Cascade, never heard of, I'll give it a try.
      tnx.

      --
      AltBlue.

        It doesn't feel like you need a tree structure at all. Maybe I am missing something...

        If you really do need a tree then Joe Celko probably has what you need. (skip the travelogue at the start)

        table: Categories: category_desc | catID table: Items: item_desc | itemID | catID table: Attributes attribute_desc | AttributID | itemID SELECT category_desc, item_desc, attribute_desc FROM Categories, Items, Attributes WHERE Categories.catID = Items.catID AND Items.itemID = Attributes.itemID;
        This scheme has many advantages over the "multiple tables" solution. A couple are:
        1. You won't have to add a table every time you add a category.

        2. With cascading deletes, you don't have delete tables when you delete a category.

        This scheme has many advantages over the "one big table" solution. Three are:
        1. You won't have to add/delete columns if you add/delete attributes
        2. In a normalized table structure you have a lot more flexibility with the queries.

        3. if you change a description you will do it in only one place

        This scheme has no disadvantages over the two you have been considering. If performance is an issue. (and it is unlikely that it is) You are better off exploring caching of one sort or another.

        Apart from being off topic, you might do better by asking database questions on a database forum. The monks are a nice bunch, but not every problem is a nail.



        email: mandog

Re: database design issue: one big table vs hundred of smaller tables
by footpad (Abbot) on Nov 26, 2001 at 08:57 UTC

    Are you perhaps looking for a data dictionary sort of application, one that tracks the locations, names, and field structures of other tables?

    To that end, perhaps the following table structures would help:

    DIRPATHS PathID - INT, NOT NULL, AUTOINC PathName - VARCHAR(16) # Adjust to taste ParentDir - INT # PathID of the parent directory, null for / TABLELIST TableID - INT, NOT NULL, AUTOINC TableName - VARCHAR(16) # Adjust to tase PathID - INT # Linking Field pointing to DIRPATHS # Add'l fields, e.g. Description, LastUpdate, BackedUpOn, etc. FIELDNAMES TableID - INT, NOT NULL # Link to match in TABLELIST FieldNo - INT, NOT NULL, # Field Position Name - VARCHAR(32) # Adjust to taste. # Description and other fields as needed ENTITIES LocalID - INT, NOT NULL, AUTOINC EntityName - VARCHAR(32) # Adjust to need/taste EntityType - ENUMERATED SET (Note: I prefer lookup tables): 0 - directory 1 - table 2 - field 3 - Etc EntityID - INT # ID from the appropriate table listed earlier

    In this scenario, you let the user search ENTITIES for the stuff they're looking for, use the LinkType field to find out the object that matched, and then take the ID back to the appropriate table. You'll need to customize backtracking code to stream a matching ID back to its original data and then handle that, however, it can be a very flexible scheme.

    The main problem is that you essentially force yourself into a maintenance problem should something change, such as the directory structure or the storage scheme. Typically, this is a BAD THING,® as anything that gets hard-coded can make life hellish for your maintenance programmer. I avoid this whenever possible.

    Perhaps a better approach is to create one table mapping tables (and paths) to arbitrary ID's and then to create a second table listing all fields of every table mapped in the previous one. The fields are larger, but it's easier to maintain in the long run. If a table gets deleted, you only need to DELETE to sets of data. And so on.

    In either case, you'll need multiple queries to get the user to the final entity that matches. However, since you're dealing with subsets after the first one, it shouldn't be that much slower than other approaches.

    In the end, it doesn't seem that much different than creating a good search engine1 for a web site. Instead of words in a document, you're trying to capture--and then search--the location of various types of data. The specifics changes, but not perhaps the overall techniques. If we can record it, we can search it.2

    As far as interesting ways to do this in Perl, I find that someone has written an interesting column or few. (Note: They may not seem topical at first glance, but I think the techniques may help.)

    --f

    1 - I do recommend getting the latest examples, though. Things have changed since this was published.

    2 - Or, a some guy named Dutch once said, "If it bleeds, we can kill it."

Re: database design issue: one big table vs hundred of smaller tables
by ask (Pilgrim) on Nov 26, 2001 at 07:04 UTC
    I would probably go with one table for the categories and a table for each distinct type of item group. (It looks like the items are too different to go into one big table).

    For fast searching, look at the latest Swish-E. You can do a "streaming" database dump formatted as simple XML and have swish-e read and index that.

    In my experience you end up doing table scans if you need any kind of advanced searching. If you insist on doing it with SQL, then be sure to get as few tables as possibly (or group them as much as possibly based on how they are going to be used). You don't want to run a thousand queries to search all categories.

     - ask

    -- 
    ask bjoern hansen, http://ask.netcetera.dk/   !try; do();
    
Re: database design issue: one big table vs hundred of smaller tables
by Ryszard (Priest) on Nov 26, 2001 at 11:38 UTC
    I'd read up on some normalisation techniques. Check out something called 3rd normal form, or if your feeling tricky BCNF (Boyce-Codd Norman form).

    In short, group things by concepts. For example:
    create table books ( cat_id number, pages_no number, author varchar ); create table categories ( cat_id <insert fav db sequence ref here>, cat_desc varchar ); insert into categories(1,computers-general); insert into categories(2,computers-programming);
    Each concept table (ala books) would have all the attributes of the concept. (for this example we're looking at pages_no, author). You could then create a foreign key constraint to categories via cat_id.

    Using a foreign key constraint you cannot have any illegal (meaning any cat_id has to be in the categories table) categories in your concept (books) table. The advantage of this is it keeps the integrity of your data, the disadvantage is it can be a pain sometimes, but well worth the effort IMHO.

    If you use your RDBMS's version of a sequence (and create any _id's as a primary key) you wont have any trouble with duplicate _id's, as a primary key is by definition unique, and a sequence is an ever incrementing value.

    As mentioned above, long wide tables are not very performant, and take up lots on unnecessary space as they tend to be sparce (ie not every field in the tuple is populated). If you go for narrow dense (ie, every field in the tuple is populated) you'll get better optimisation from your database.

    Make sure you think about indexes as well. Indexes should be on primary keys at a minumum, and should be on a different physical device to your tables (if you have the hardware). In fact, if youre going hardcore, think about different disks for the os, the database software, the tables, and indexes.

    If you dont have too much data and the transaction rate is not high one disk is prolly enuff for everything.. ;-)

    Disclaimer
    The table structure assummes you've got the same list of 'attributes' for each concept. ie every book has pages, and an auther, regardless of if its a programming book, or a book on gardening.

    Alternatively, if you've got lots of concepts and shared attributes, (ie the server, workstation type of thing) perhaps you should look at something like LDAP or a multidimensional database that caters for sparcity in your data.
Re: database design issue: one big table vs hundred of smaller tables
by Zaxo (Archbishop) on Nov 26, 2001 at 08:09 UTC

    You should look to normalization for each of your catagories. For instance, the kids/grown_ups split dupes information in the birth date field of each. Kids are constantly filtering into the grown-up side of the ledger. Unifying that classification to just Persons will let you select by whatever age range is desired, without maintainance of redundant categories.

    After Compline,
    Zaxo

Re: database design issue: (solved with Cascade)
by markjugg (Curate) on Nov 27, 2001 at 02:10 UTC
    AltBlue,

    I solved the same problem a couple years ago with Cascade. It's uses Perl and Postgres (MySQL is supported in some versions) and is available under the GPL.

    Ovid's solution is the most intuitive one for handling the "tree" structure of the categories, but it won't provide the best performance as some others methods (as he hints at). It also makes some SQL queries hard. I solved this by adding a "sort_key" in addition to the parent_id, which gives you your exact location in the tree, allowing you answer almost all questions about relationships in the tree in one query. You can peek in my data model and code to see how that works.

    But don't go solving the problem again, just use Cascade. :) The demo site is updated to use the 2.1.0 BETA version at the moment, which isn't quite released, but adds several new features. -mark

Re: database design issue: one big table vs hundred of smaller tables
by jepri (Parson) on Nov 27, 2001 at 06:41 UTC
    You could try a funky rules-based langauge like Prolog, but I don't know how it would stand up to a lot of data.

    ____________________
    Jeremy
    I didn't believe in evil until I dated it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://127470]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-04-25 14:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found