Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

DBI vs MLDBM/GDBM_File, etc.

by ZZamboni (Curate)
on Jul 03, 2000 at 23:18 UTC ( [id://20918]=perlquestion: print w/replies, xml ) Need Help??

ZZamboni has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

Although I have been programming in Perl for a number of years now, I have never used any database modules. Yes, don't laugh.

Now I have this program in which I have been working, and so far I had been using flat files to store the data. It works ok, but as the amount of data grows, access becomes slower and slower. So I started looking into more sophisticated database solutions.

I have looked at several of the traditional DB modules, such as GDBM_File, DB_File, and MLDBM (which layers on top of any of the other two). Using MLDBM would make it much easier to port my existing program, because I already store things in a hash in my existing program. I'm just looking for a way to improve my program's performance in reading and writing its database.

Of course, I also looked at DBI. I even bought the book. However, it seems that most of the well-developed DBI drivers are for full-blown database servers, such as Oracle or MySQL. I need something standalone. But there don't seem to be DBD's for GDBM and friends, are they? Furthermore, I have the feeling that using DBI would involve massive rewriting, planning of the tables, and the learning curve of SQL. Maybe I'm just afraid because it's new territory for me.

So I guess I'm just looking for some advice here from the more experienced monks. Is using an "old" DB module such as MLDBM with GDBM_File considered harmful? Should I use DBI at any cost? I have very little formal database programming background, so maybe I'm just not making sense here.

Any wisdom will be very appreciated.

Thanks,

--ZZamboni

Replies are listed 'Best First'.
Re: DBI vs MLDBM/GDBM_File
by eduardo (Curate) on Jul 04, 2000 at 01:14 UTC
    I'll give you the same advice I give everyone in a situation similar to yours... make the program fit the data, not the other way around. Things like MLDBM are extensions of a system that was not designed to handle data that was more complex than a simple key => value pair. RDMBS's were designed to handle, you guessed it, relational data!

    The power in a RDBMS comes from the ability to normalize your data into units that make it easy to manage, and easy to query and join into ways that may have been computationally expensive or sometimes even untractable in a traditional storage solution. The power of an RDBMS is that the structure of your data is transparent to you, it is it's context and purpose that matters to you. Order doesn't really exist in an RDBMS outside of the order in which you wish to have the data described to you when you extract it. The power of an RDBMD becomes obvious in situations such as this:

    supplier_table
    id supplier
    1 foo
    2 bar
    3 baz

    item_table
    id item
    1 screw
    2 bolt
    3 stereo

    supplier_item_xref
    supplier_id item_id
    1 1
    1 3
    2 2
    3 1
    3 2
    3 3

    and let's say you want to see all of the suppliers that happen to supply items screw AND stereo. it would be as saying:

    select supplier_table.supplier from supplier_table, item_table, supplier_item_xref where supplier_table.id = supplier_item_xref.supplier_id and supplier_item_xref.item_id = item_table.id and (item_table.item = 'screw' OR item_table.item= 'stereo');
    this would return to you foo and baz... and all of the processing, the joining, the finding of items, the finding of correlations would be handled by the database engine, instead of having to do it by hand. Usually a database engine will have been optimized MUCH more than you or I could do for general purposes, so it will be able to do this EXTREMELY effectively if you code your queries halfway intelligently.

    so now you are asking, why on earth would I use something other than an RDBMS? Well, because often times, your data is NOT relational, the different correlations between different atomic aspects of the data may or may not interest you! You may really just have a record, that can be accessed by a key, and that aspect of the program specification is guaranteed not to change. In cases like this, I go by an old engineers adage, minimize the number of moving parts! If a system's data does not require the complexity inherent in the creation of a relational model, then it is pointless to bring it on!

    so, the question you have to ask yourself is, what type of data do I need to store, how does my data need to be retreived, how large will my data set grow? If the answers to these questions point to a simpler answer, then something like a DBM will serve you well, serve you with little overhead, and with minimal added complexity. If however, you see that the system requirements will someday need more advanced access to the data, it is time to invest in learning DBI, SQL, and RDBMS concepts.

      But this is a religious issue with me. As long as the RDBMS supports it, you should use ANSI 92 join syntax. And you should use table aliasing as well. So your query ought to be written as
      select st.supplier from supplier_table as st join supplier_item_xref as six on st.id = six.supplier_id join item_table as it on six.item_id = it.id where it.item in ('screw', 'stereo');
      This allows the SQL optimizer to do a better job and can result is faster code for complex joins. If your RDBMS doesn't support this, upgrade. It makes your code more readable. --Shoeboy
      I'm not wearing any pants

        You're right, of course (about the join syntax - I really can't comment on the pants!) but there are a number of old timers like me who have been using the older syntax for far too long and will take some getting used tto these new-fangled ideas :-)

        --
        <http://www.dave.org.uk>

        European Perl Conference - Sept 22/24 2000
        <http://www.yapc.org/Europe/>
Re: DBI vs MLDBM/GDBM_File
by infoninja (Friar) on Jul 03, 2000 at 23:27 UTC
    If you decide you want to go with DBI, you should probably look at DBD::RAM. As re: DBI vs. one of the traditional DB modules, I'm biased towards DBI, but that's mainly b/c the vast majority of the work that I do to pay the bills is SQL related....DBD::RAM can use in memory tables or files, so you wouldn't need to install a database program.
Re: DBI vs MLDBM/GDBM_File
by cwest (Friar) on Jul 03, 2000 at 23:21 UTC
    You could try Text::CSV_XS, the data files are comma delimited and you have some built in support for adding and retrieving rows... and a few other things.

    I would personally go for the MySQL route, especially since the went GPL!

    
    
    --
    Casey
    
      I have seen Text::CSV_XS, and I also know of the existence of DBD::CSV. But the problem is that a CSV file is still a flat text file, with no direct indexing capabilities. And that's precisely what I want: to avoid having to rewrite the whole file every time I make a change.

      I'd like to avoid MySQL because you need to have the server running. I would like to have something standalone.

      Thanks,

      --ZZamboni

        If you want something standalone, try DBD::RAM. You can still have the SQL syntax and whatnot without the server. However, you then do not get the benefits of a server, such as threading, indexing, locking (heaven forbid someone changes your flat files), control over who accesses your data, etc... Out of curiosity, why do you want it standalone? Because you don't want to muck with a server, or a design reason of some sort?

        Cheers,
        KM

Re: DBI vs MLDBM/GDBM_File
by PipTigger (Hermit) on Jul 04, 2000 at 10:19 UTC
    I had a similar problem 2 years ago. My solution was to cook up a weird disk-based hashing method. Each of my records had a primary identifier (a unique name) which I converted to ascii and summed the integers. I then multiplied by a really large prime number (for fun and further diversity) and then I added the ascii values for the first and last characters again and then did modulo division with 65536. The result for each record was a pretty quickly computed bucket for each record which was nearly random. I converted the remainder to hex so I'd get filenames between 00/00.txt .. FF/FF.txt ... up to 256 directories each with up to 256 text files. I added many hundreds of thousands of records and it was still wonderfully fast but it would have been horrendous if they'd all resided in a single file. If you don't have a unique key for each record, this solution would not be helpful but it werked really well for me back in the day.

    Since then, I've picked up the O'Reilly MySQL & mSQL book and have lerned DBI. If you have permissions to setup a daemon on a server, it's really pretty painless. I've compiled MySQL a couple of times now and installed 3 or 4 different versions (at different times) without too much headache. SQL is like Perl in that it is a rich language but it's okay not to know everything immediately. There are a few key things that can be lerned quickly and you can get by with those... the rest just makes you better. SQL is easy enough and DBI seems to provide the portability and flexibility to plug in differing database engines and drivers rather simply. I've been won over to relational databases since they are really powerful and they're not as hard as I thought they'd be to use (maybe MySQL is a cakewalk compared to Oracle but it's a start). I hope this helps somewhat. I'd be glad to share more of my flatfile madness if you'd like. Shalom.

    -PipTigger

    p.s. Initiate Nail Removal Immediately!

Re: DBI vs MLDBM/GDBM_File
by KM (Priest) on Jul 03, 2000 at 23:27 UTC
    There isn't inherently wrong with using Berkley DB, and the various DBM modules. Personally, I would also go with the MySQL or Postgres route. Not only should it help the speed of your program, but once you have everything set up for DBI, you should be much happier.

    Cheers,
    KM

Re: DBI vs MLDBM/GDBM_File
by Anonymous Monk on Jul 05, 2000 at 16:27 UTC
    Crikey - reading those replies made my eyes hurt.

    Points:
    1. GDBM_File is *really* easy. If you're just using a glorified hash, you just tie() it to a disk file and it bgehaves exactly like a normal hash. Large-ish files (in comparison to flat files) but simple, faster and hacky.

    2. SQL is also really easy. Learning more about relational databases will look good on your CV. MySQL is free, but unless you host the site you'll need to find a host who runs it. If it's a standalone program then you require the users to install MySQL and the DBD/DBI interfaces, then run a setup script to check the db is set up OK, install the tables etc etc.

    My advice? Depends on how much time you have. If you have the time and desire to learn some SQL, do it - it crops up everywhere these days, and it really is easy as (insert easy thing here). Otherwise, just use GDBM_File. It sounds like it will be painless to implement, and it is stupendously easy.

    One more point - ignore that complicated JOIN stuff above, just keep it simple like the first example and your program will work. Yes, it can be optimised further, but hey, it's easy to maintain, and the difference is bound to be negligible. The first example is readable to someone who doesn't know SQL, the second isn't. Go figure.

    Adam
    adam@sixzeds.com

Re: DBI vs MLDBM/GDBM_File
by Jonathan (Curate) on Jul 04, 2000 at 13:55 UTC
    No ones asked you the most important questions.
    1. What exactly are your storage requirements?
    2. How complex are your data entities?
    3. Do you have many users? If so, what sort of queries/updates will they be doing?


    I've been using SQL for over 10 years now. I can't say I like it. Discussing the merits of ANSI SQL (not many!) is irrelevant at this stage.
    BTW why do people forget that the optimiser is the center of a RDBMS. If it is a poor one (are you listening Oracle and Sybase? :) following ANSI can be the worse thing to do. Granted I haven't used Oracle for a couple of years the last time I did the query optimizer was given up and we reverted to rules based queries.
    Another problem with some RDBMS's is that they require almost constant DBA work to keep them going.

    Give careful thought to your data requirements. Bear in mind that a RDBMS is geared to larger data loads. If your requirements are fairly small then the DBM modules would probably give better performance.

    gl
Re: DBI vs MLDBM/GDBM_File
by Shoeboy (Sexton) on Jul 03, 2000 at 23:49 UTC
    OK, Going with a full on RDBMS will probably be overkill. Going with a quick and dirty db like MySQL or MS access is fine, but you'll have to pick up some SQL (not hard). Going with DBM is fine for simple data. If you've got a very large data set you'll want a full RDBMS for the data caching and rich indexing options. my advice is do what will get the job done now, then pick an RDBMS and learn it once you can afford to spend some time clambering up the learning curve. I use Win32::OLE (ADO) and MS SQL at work. At home, I try not to have data that won't fit in main memory. --Shoeboy I'm not wearing any pants
Re: DBI vs MLDBM/GDBM_File
by takshaka (Friar) on Jul 04, 2000 at 09:08 UTC
    I've used MLDBM with Storable over DB_File a few times. It is perfectly fine as long as you don't need to do searches through the value part of the data--for that you really want a relational solution. But as long as you're just doing key lookups, it's wonderful.

    MLDBM is not necessarily a drop-in replacement, however. The biggest "gotcha" is that you cannot assign a value directly to a non-top-level element. That is, you can't do this to a MLDBM tied hash: $hash{foo}{bar}{baz} = 'stuff';

    You must instead do this:

    $tmp = $hash{foo}; $tmp->{bar}{baz} = 'stuff'; $hash{foo} = $tmp;
The POWER of flat files
by gaggio (Friar) on Jul 04, 2000 at 20:23 UTC
    Hey, I have taken two database courses, and I had a very good introduction to the concept of relational databases (IMHO).

    I wanted to just remind all the people here how powerful a flat text file is for simple data storage. You are right, Jonathan for having asked those questions. Everything depends on what type of data ZZamboni wants to store, and how he needs to process it.

    I will add my two cents to the thread by saying that for simple storage, there is nothing better than flat files. The flat files are growing up? So what??? Why aren't you updating the system so that it would use more than one unique file. Imagine you have 10,000 items to store. Well, why not having 10 files to store them? The first file for the first 1000 items, and so forth. Speedwise, I am telling you, you will end up with something a LOT faster than any other big DB package or wrapper like all the DBI stuff. Because those packages are, in fact, also using some big files I guess...
    So... how complex is your data? How do you want to access it? Do you need any relational concepts? If the only concern you have is the speed, then my advice would be Keep the flat file system - Just improve it a little!.
      Gaggio,

      We all know that there is a right tool for the right job, and flat files will be of great help for most storing small amounts of data. But seriously, after having worked for 2 years ONLY with flatfiles, I would never, EVER, recommend to anyone doing the approach just mentioned above. There is no telling how much "Just improving a little" your data will require in the future, and that could bring you down to your knees with maintanance in a short while.

      BTW: there is NO WAY that you will get faster access from a flatfile (or series of them) than from a real DBMS. Expensive DBMS's like Oracle, Informix, DB2, et al will even keep most accessed data in RAM to avoid going to disk for frequent queries. Sure, you can do that with perl too, but you're not talking about 'just' flatfiles anymore.

      Please reconsider your advice, there are tools and tools for each job, and I don't think flatfiles will keep you running as smoothly as you think.

      #!/home/bbq/bin/perl
      # Trust no1!
        You are right, BBQ, when you say that there are tools and tools for each job. Don't make me say what I did not say. I said that flat files are fast depending on the use you make of them.

        I should also have added that I am still a student, and I am not the administrator of University website. In my case, I think that flat files are the solution, compared to huge-pain-in-the-ass-to-install DMBS systems. I never said that MySQL was not fast. This is right, caching make the overall performance acceptable.

        But again, I am saying that the easiest solution for ZZamboni might be to keep the flat file format. *Might*, because he did not say everything about what kind of data it is, and what use he wants to make out of it.

        Father Gaggio
      Since everyone is coming down on flat-files, I figure its about time for me to come to their defense.

      Relational Databasese are great for most types of data storage. Particilarly when the data will be accessed randomly, when you need something that will maintain atomic transactions and referential integrity for you, and having the physucal data-structure abstracted for you. However, there are several occasions when you can't beat the speed and convenience of flat-files:

      1. data files that only need to be appended to and are read infrequently or sequentially. It is almost always faster to append a line to a flat file than adding a row to a DB table; there is no overhead of creating index entries, etc....
      2. data files that only need to be accessed sequentially and completely, never randomly.
      Also if your data is strongly heirarchical you could be better off using a heirarchical datastore, such as your computer's filesystem.
      Actually, flat files are not faster for large amounts of records...beyond that, as the data is pulled into your program you have to take extra measures to tie it all
      together, whereas a RDBMS will already have all of the information related for you.

      Not to mention the fact that a lot of things that you would have to do programatically with flat files you can do with the DB and it's built in methods.
      • Sort the list? Use ORDER BY
      • Get only unique values? Use DISTINCT
      • Need to tie two tables together? Use JOIN
      • Need transaction support? Use PostGre or SQLServer or Oracle and you have it.
      • etc...
      There are many benefits that come from using a RDBMS, and if ZZamboni wants his site to grow easier, then he should port to one. The investment is worth the
      payoff later, and he gets to learn a whole new side of Perl he has not done yet.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://20918]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-03-29 11:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found