this is quite an off topic subject as it is in fact just a database design
question, but i'm sure there sure are some DBMS specialists among us that
maybe could and would illuminate me :)
i'm working on a 'virtual catalog' project that requires building some kind of
'directory tree' inside a DB. There are stored 'categories' and 'items'. Each 'item'
has a number of attributes, different for each category.
here is an illustration that should explain it all:
base/
persons/
kids/
jim # name, surname, birth, mom_name, pap_name,
tim # sister_name, brother_name, ....
john #
grown_ups/
bob # name, surname, birth, salary, company,
rob # phone_no, fax_no, email, homepage, ...
books/
manuals/
linux # pages_no, authors, edition, cover_photo,
math # editure, pub_year, ....
geography #
dictionaries/
greek # terms_no, editure, editure, pub_year ...
latin #
arabic #
sf/
dune # pages_no, volumes, editure, author, pub_year
foundation # prizes, rating, ...
ubik #
computers/
workstations/
john's # owner, os, monitor, video, ram, ip, location,
mine # hdd, cpu, keyb, mouse, ...
his #
servers/
alpha # location, ip_list, os, ram, cpu, hdd, services
beta # ...
gamma #
ok, so I have all these different 'things' that should live together
in the same eclectic catalog. I want to be able to do complex but still fast
searching all around this db. The scalability range I would like to be somewhere
near thousands of categories and thousands of unique items in each category.
how to organize it? As I see it at this moment there are two main solutions:
- 2 big tables (1 categories, 1 items)
- many tables (1 table with categories and a separate table for each category)
What would be the best database design for such a thing, keeping in mind that searching through this db should be as fast as possible? Also remember that attributes may differ a lot in number (and size).
The first solution I think it's crappy and wouldn't like to go on with it. The second is the one that I have already implemented and the results are pretty good until now, but I haven't yet had the time to build up some real stress test scenarios so I don't know for now what could happen when I would reach to 1K categories and 1K items in each category =)
heh, I haven't mentioned the DBMS because I wouldn't like to get some particular solutions optimised for xSQL or ySQL ;-) ... yet, ofc, notices regarding implementation flaws on i.e. mySQL or PostgreSQL or whatsoever are welcome
--
AltBlue.
(Ovid) Re: database design issue: one big table vs hundred of smaller tables
by Ovid (Cardinal) on Nov 26, 2001 at 07:41 UTC
|
I don't see that two tables are going to be a truly viable solution. Your items have sufficiently unique and non-overlapping attributes that trying to stuff them all into one huge table is going to wind up with many items having a lot of blank fields. You're just begging for trouble with that. To distinguish what items have what attributes, you'd either have to create an "itemAttribute" table, describing what items in what categories have what attributes (and that's going to complicate your SQL), or you'll need to move that information into your code which will then require code updates to be more frequent as you're forced to synchronize the code and the database with subsequent updates.
I'd probably start by building a table similar to the following:
nodeID parentNodeID Description
------ ------------ -----------
1 1 persons
2 1 kids
3 1 grown_ups
4 4 books
5 4 manuals
6 4 dictionaries
You can make the nodeID the primary key and have a foreign key constraint from the parentNodeID to the nodeID. To find all root nodes:
SELECT nodeID where nodeID = parentNodeID
That's fairly easy to drill down in and with proper indexing, should be simple to search. It gets a little more complicated with items because you can't just stick a nodeID in items and then select all items with the appropriate nodeID. Perhaps you could create a meta-table listing the table names of the items you need and create lookup tables with that (mind you, this is all off the top of my head).
Obviously, you want this to be as fast as possible, but I would strongly caution you against designing this system with speed in mind. You want your database schema to be correct. If the schema is poor, you're much more likely to suffer corruption from anomalies. If that starts to happen, you either have to redesign your database correctly, or put in a bunch of extra code to prevent the anomalies. The latter option, of course, is going to have a performance impact, so you'd wind up losing what you might think you gain by trying to optimize up front.
Prematurely optimizing is a common problem. Anyone with any experience with databases knows the value of a properly normalized schema. Since your application has not been written yet, you can't judge whether or not performance is going to be a problem. Why trade the known benefits of normalization with the unknown benefits of optimization when you don't even know if performance is yet an issue?
Cheers,
Ovid
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats. | [reply] [d/l] [select] |
|
-- 'attrs' keeps the attributes list in CSV form
CREATE TABLE categories (
id INT UNSIGNED NOT NULL UNIQUE AUTO_INCREMENT,
name CHAR(255) NOT NULL UNIQUE DEFAULT '',
parent INT UNSIGNED NOT NULL DEFAULT '',
attrs CHAR(255) NOT NULL DEFAULT '',
time TIMESTAMP,
PRIMARY KEY(id),
KEY(name(16)),
KEY(parent)
);
-- _[id] is the id of the category as existent in 'categories' table
CREATE TABLE cat_[id] (
id INT UNSIGNED NOT NULL UNIQUE AUTO_INCREMENT,
name CHAR(255) NOT NULL UNIQUE DEFAULT '',
time TIMESTAMP,
... attribute
... attribute
... attribute
PRIMARY KEY(id),
KEY(name(16))
);
-- bloat table to speed up common search patterns
CREATE TABLE items_all (
id INT UNSIGNED NOT NULL UNIQUE AUTO_INCREMENT,
name CHAR(255) NOT NULL DEFAULT '',
category CHAR(255) NOT NULL DEFAULT '',
cat_id INT UNSIGNED NOT NULL,
PRIMARY KEY(id),
KEY(name(16)),
KEY(category(16)),
KEY(cat_id),
);
As you see, it's quite a clean/lame implementation of the second variant,
as stated in my previous post. footpad suggested a 'fields/attributes'
table instead of my 'attrs' column in the categories table. It should be
an improvement for a large number of attributes, but I'm thinking there
will be no such category with more than let's say 20 attributes for each
category.
Normalization is a good thing, not that I was not thinking about it,
but the problem at the first (the second too :) glance is that it can't
be no normalization as I want to keep the bazaar feature ;-) meaning that
this system should be able to store people as well as books, tools,
flavours, images or anything else disregarding any resemblance between
categories.
In fact the only relation that I'm thinking to enhance somehow is the one
of father-child category: as many of you have noted, there could be many overlapping attributes in some cases (humans-:grown-ups-:artists) so maybe there some normalization should be considered. Heh, some OOP techniques would be better in these cases as inheritance. Even more, polimorphism would be needed if i.e. would like to move around items from one category to another (i.e. 'musicians' from 'humans/grownups/artists/' in 'music/') :)
Heh, it will get so complex that I think that KISS should come up.For speed up purposes there may be used additional redundant databases containing specialized table in various things etc (as my all_items table).
I have already used Swish and maybe I will use it for this project too. Cascade, never heard of, I'll give it a try.
tnx.
--
AltBlue.
| [reply] [d/l] |
|
table: Categories:
category_desc | catID
table: Items:
item_desc | itemID | catID
table: Attributes
attribute_desc | AttributID | itemID
SELECT category_desc, item_desc, attribute_desc
FROM Categories, Items, Attributes
WHERE Categories.catID = Items.catID AND
Items.itemID = Attributes.itemID;
This scheme has many advantages over the "multiple tables" solution. A couple are:
- You won't have to add a table every time you add a category.
- With cascading deletes, you don't have delete tables when you delete a category.
This scheme has many advantages over the "one big table" solution. Three are:
- You won't have to add/delete columns if you add/delete attributes
- In a normalized table structure you have a lot more flexibility with the queries.
- if you change a description you will do it in only one place
This scheme has no disadvantages over the two you have been considering. If performance is an issue. (and it is unlikely that it is) You are better off exploring caching of one sort or another.
Apart from being off topic, you might do better by asking database questions on
a database forum. The monks are a nice bunch, but not every problem is a nail.
email: mandog
| [reply] [d/l] |
Re: database design issue: one big table vs hundred of smaller tables
by footpad (Abbot) on Nov 26, 2001 at 08:57 UTC
|
Are you perhaps looking for a data dictionary sort of application, one that tracks the locations, names, and field structures of other tables?
To that end, perhaps the following table structures would help:
DIRPATHS
PathID - INT, NOT NULL, AUTOINC
PathName - VARCHAR(16) # Adjust to taste
ParentDir - INT # PathID of the parent directory, null for /
TABLELIST
TableID - INT, NOT NULL, AUTOINC
TableName - VARCHAR(16) # Adjust to tase
PathID - INT # Linking Field pointing to DIRPATHS
# Add'l fields, e.g. Description, LastUpdate, BackedUpOn, etc.
FIELDNAMES
TableID - INT, NOT NULL # Link to match in TABLELIST
FieldNo - INT, NOT NULL, # Field Position
Name - VARCHAR(32) # Adjust to taste.
# Description and other fields as needed
ENTITIES
LocalID - INT, NOT NULL, AUTOINC
EntityName - VARCHAR(32) # Adjust to need/taste
EntityType - ENUMERATED SET (Note: I prefer lookup tables):
0 - directory
1 - table
2 - field
3 - Etc
EntityID - INT # ID from the appropriate table listed earlier
In this scenario, you let the user search ENTITIES for the stuff they're looking for, use the LinkType field to find out the object that matched, and then take the ID back to the appropriate table. You'll need to customize backtracking code to stream a matching ID back to its original data and then handle that, however, it can be a very flexible scheme.
The main problem is that you essentially force yourself into a maintenance problem should something change, such as the directory structure or the storage scheme. Typically, this is a BAD THING,® as anything that gets hard-coded can make life hellish for your maintenance programmer. I avoid this whenever possible.
Perhaps a better approach is to create one table mapping tables (and paths) to arbitrary ID's and then to create a second table listing all fields of every table mapped in the previous one. The fields are larger, but it's easier to maintain in the long run. If a table gets deleted, you only need to DELETE to sets of data. And so on.
In either case, you'll need multiple queries to get the user to the final entity that matches. However, since you're dealing with subsets after the first one, it shouldn't be that much slower than other approaches.
In the end, it doesn't seem that much different than creating a good search engine1 for a web site. Instead of words in a document, you're trying to capture--and then search--the location of various types of data. The specifics changes, but not perhaps the overall techniques. If we can record it, we can search it.2
As far as interesting ways to do this in Perl, I find that someone has written an interesting column or few. (Note: They may not seem topical at first glance, but I think the techniques may help.)
--f
1 - I do recommend getting the latest examples, though. Things have changed since this was published.
2 - Or, a some guy named Dutch once said, "If it bleeds, we can kill it." | [reply] [d/l] |
Re: database design issue: one big table vs hundred of smaller tables
by ask (Pilgrim) on Nov 26, 2001 at 07:04 UTC
|
I would probably go with one table for the categories and a table for each distinct type of item group. (It looks like the items are too different to go into one big table).
For fast searching, look at the latest Swish-E. You can do a "streaming" database dump formatted as simple XML and have swish-e read and index that.
In my experience you end up doing table scans if you need any kind of advanced searching. If you insist on doing it with SQL, then be sure to get as few tables as possibly (or group them as much as possibly based on how they are going to be used). You don't want to run a thousand queries to search all categories.
- ask
--
ask bjoern hansen, http://ask.netcetera.dk/ !try; do();
| [reply] |
Re: database design issue: one big table vs hundred of smaller tables
by Ryszard (Priest) on Nov 26, 2001 at 11:38 UTC
|
I'd read up on some normalisation techniques. Check out something called 3rd normal form, or if your feeling tricky BCNF (Boyce-Codd Norman form).
In short, group things by concepts. For example:
create table books (
cat_id number,
pages_no number,
author varchar
);
create table categories (
cat_id <insert fav db sequence ref here>,
cat_desc varchar
);
insert into categories(1,computers-general);
insert into categories(2,computers-programming);
Each concept table (ala books) would have all the attributes of the concept. (for this example we're looking at pages_no, author). You could then create a foreign key constraint to categories via cat_id.
Using a foreign key constraint you cannot have any illegal (meaning any cat_id has to be in the categories table) categories in your concept (books) table. The advantage of this is it keeps the integrity of your data, the disadvantage is it can be a pain sometimes, but well worth the effort IMHO.
If you use your RDBMS's version of a sequence (and create any _id's as a primary key) you wont have any trouble with duplicate _id's, as a primary key is by definition unique, and a sequence is an ever incrementing value.
As mentioned above, long wide tables are not very performant, and take up lots on unnecessary space as they tend to be sparce (ie not every field in the tuple is populated). If you go for narrow dense (ie, every field in the tuple is populated) you'll get better optimisation from your database.
Make sure you think about indexes as well. Indexes should be on primary keys at a minumum, and should be on a different physical device to your tables (if you have the hardware). In fact, if youre going hardcore, think about different disks for the os, the database software, the tables, and indexes.
If you dont have too much data and the transaction rate is not high one disk is prolly enuff for everything.. ;-)
Disclaimer
The table structure assummes you've got the same list of 'attributes' for each concept. ie every book has pages, and an auther, regardless of if its a programming book, or a book on gardening.
Alternatively, if you've got lots of concepts and shared attributes, (ie the server, workstation type of thing) perhaps you should look at something like LDAP or a multidimensional database that caters for sparcity in your data. | [reply] [d/l] |
Re: database design issue: one big table vs hundred of smaller tables
by Zaxo (Archbishop) on Nov 26, 2001 at 08:09 UTC
|
You should look to normalization for each of your catagories. For instance, the kids/grown_ups split dupes information in the birth date field of each. Kids are constantly filtering into the grown-up side of the ledger. Unifying that classification to just Persons will let you select by whatever age range is desired, without maintainance of redundant categories.
After Compline, Zaxo
| [reply] |
Re: database design issue: (solved with Cascade)
by markjugg (Curate) on Nov 27, 2001 at 02:10 UTC
|
AltBlue,
I solved the same problem a couple years ago with Cascade. It's uses Perl and Postgres (MySQL is supported in some versions) and is available under the GPL.
Ovid's solution is the most intuitive one for handling the "tree" structure of the categories, but it won't provide the best performance as some others methods (as he hints at). It also makes some SQL queries hard. I solved this by adding a "sort_key" in addition to the parent_id, which gives you your exact location in the tree, allowing you answer almost all questions about relationships in the tree in one query. You can peek in my data model and code to see how that works.
But don't go solving the problem again, just use Cascade. :) The demo site is updated to use the 2.1.0 BETA version at the moment, which isn't quite released, but adds several new features.
-mark | [reply] |
Re: database design issue: one big table vs hundred of smaller tables
by jepri (Parson) on Nov 27, 2001 at 06:41 UTC
|
| [reply] |
|
|