http://qs321.pair.com?node_id=57686

DarkBlue has asked for the wisdom of the Perl Monks concerning the following question:

I have written a database with Perl 5.6 using GDBM_File. The records could be quite large as I have built no limits in to the database to impose on either record size or number of records.
The database is accessed with GDBM_File as a tied hash. It runs on a web-server via the CGI.

Questions:
1) are there any good reasons why I should limit the size and/or number of records (obviously smaller records would be quicker to download - but these are text only, so even quite large records are manageable)?
2) when using tied hashes, is the entire database file loaded into memory each time I access it, or does Perl load only the record(s) being used at that particular time? My main concern here is that, if the former applies, as the database grows larger - and multiple users access it simultaneously, then I can quickly see the web-server running out of memory (with, I imagine, disastrous effects). Furthermore, using the database is going to be painfully slow when the volume of records is large.

As it stands, it's running very quickly (with instantaneous response); but it's only being used by a small workgroup and contains less than a hundred records. We estimate that between ten and twenty new records could be added each day. We also plan to make the database accessible outside of our workgroup (accessible by anyone with a web-browser), we have no idea how many simultaneous users it might then need to serve, but we're predicting about a thousand users a day (users, not accesses).
Before I proceed any further with the database in this form I need to know that it's not going to fall over or become unuseable due to a lack of speed.
I know I should probably use Oracle or SQL (et al), but I'm unfamiliar with both products and would rather stick with what I know at the moment (we can migrate to something more "industrial" once our concept is proven - should that prove necessary).
Any help would be appreciated.

Cheers.

Replies are listed 'Best First'.
Re (tilly) 1: Databases and tied hashes.
by tilly (Archbishop) on Feb 11, 2001 at 07:32 UTC
    This is something I had a serious interest in at one point, so you are going to get a longer than expected answer... :-)

    In answer to 1, whether there is a size limit depends on the database format. GDBM and DB_File are both essentially unlimited. (On 32-bit platforms your main limit is the fact that pointers are only able to address 2 GB of data. If you have large file support, I don't know what GDBM can do, but DB_File is OK out to hundreds of terabytes of data, disk permitting.)

    In answer to 2, the entire database is not loaded into memory. However hashing algorithms have very poor data locality by the design of the algorithms. Therefore for very large data sets you will have to hit disk for every request. Therefore if you anticipate hundreds of megs or more of data to deal with, you may wish to switch to using DB_File so you can use a BTree. (Which has great locality of reference. In practice with very large active data sets it touches disk less than once per request.)

    Beyond that more issues. First of all if you are editing the data set from 2 CGI scripts at once, you can easily get data corruption. The odds of this are fairly low, but by no means non-existent. The documentation and the Perl Cookbook suggest that you may safely lock the filehandle to the database and offer sample code. At least for DB_File this is not true. Instead you must lock a sentinel file. I offer some reasonable code for this at Simple Locking. (You would want to lock attempts to access for write, not for read.) Alternately you can use the newer BerkeleyDB module with a recent Berkeley DB to use native transactional support. After detailed discussions with Sleepycat on the needs of CGI scripts, I cannot recommend this for critical data as there are some non-recoverable disasters.

    Secondly there are a variety of conditions that can result in data corruption (killing scripts at the wrong time, hardware failure, etc) and therefore you should not only have regular tested backups, but you should have backup data in a plain-text format. (This is generally good advice.)

    Thirdly read your database documentation. Frequently it will have notes about having to be stored locally, not over NFS. Take this seriously. If you wish to have a dbm read-write accessible from multiple machines, you will want to produce a dedicated server running on top of it and connect to that server from the various machines. (This is what traditional relational databases do. It is slower. It increases latency. But it will allow you to scale much better.)

    Fourth note that SQL database and dbm databases are very different in design and concept. The usage patterns that work with them are very different. Therefore while an application may be ported from one dbm to another, or one SQL database to another, in general porting from using a SQL database to a dbm or vica versa is a different kettle of fish. As you have found out, dbms have a very simple data model that maps very well to how Perl works. SQL databases have a much more complex model and derive a great deal of their power from their ability to figure out how to connect information on the fly in interesting ways.

    In general a relational database can handle much more complex data models than a dbm database can, but there is a bit of a learning curve in using one. And it is fairly common to see an SQL database being used in a pattern that is more appropriate for a dbm, or the grown-up relative of a dbm, an OO database.

    As an internal note, dbms and relational databases use the exact same techiques to be able to do fast accesses of data. Indeed the Sleepycat folks (who maintain Berkeley DB, which DB_File accesses) last I heard are seriously thinking of adding an optional SQL front end on top. What this would mean is that a collection of dbms would be organized and accessible in a relational manner.

Re: Databases and tied hashes.
by DarkBlue (Sexton) on Feb 11, 2001 at 19:53 UTC
    Summarising then, from the various replies I have received (with thanks):

    1) Whilst in concept stage I am more than adequately served by GDBM_File for a relatively simple database structure

    2) As I do not require a relational database for this product, I can effectively ignore SQL, Oracle, Postgres, etc., as the demands of my database (even with the most optimistic predictions) simply don't justify the capabilities of those systems

    3) To "industrialise" my product, BerkeleyDB is easily capable of serving my database ("multi-threaded concurrent read/write, hundreds of terabytes of data, 30,000+ accesses per second"). The server should have large file support in order to take full advantage of BerkeleyDB. Therefore I can keep the entire product non-proprietary and open-source

    4) BerkeleyDB will allow me to use BTree as and when required (for reduced disk access during usage)

    5) I should probably implement file-locking on the database for any write operation to maintain data integrity (despite the slight latency this will introduce)

    6) The database should be backed-up regularly (which goes without saying) and plain-text backup is also probably a good idea

    7) The database should run on its own, dedicated server. Not on a server that's also having to serve web-pages and other CGI

    8) When using a tied-hash/GDBM_File database model, the entire database IS NOT loaded into memory for every single read or write operation (or thread). But there will be disk accesses for every such operation. Therefore, I'd conclude that the server needed to host this application would probably be okay with between 256Mb - 1Gb or RAM. The processor should be as fast as possible. The disk sub-system should be as fast as possible. RAID (what level?) would be best as the fault-tolerance would benefit the "industrialised" version of this database.

    Does anyone disagree with any points in this summary?

    Thanks again for your speedy replies and the invaluable insight you have given me.

    Jonathan M. Hollin Digital-Word.com

      Some clarifications on those points.
      1. If your final version won't need an SQL database, then a dbm database is fine for the concept stage.
      2. What separates the need for a relational database from a dbm is your data model. If you are starting to get into relationships and correlations between data (eg taking sales figures and getting reports of sales by customer, by product etc) then you clearly wanted a relational database. If you want a simple lookup, then a dbm is just fine.
      3. Berkeley DB is indeed an industrial strength database. It is particularly well suited to situations which need very high performance for simple tasks. (It is also great for embedded use, but I digress.) The bottlenecks that you will hit first have to do with the CGI model.
      4. Yes. GDBM may as well have BTrees. The wins of BTrees here are that they keep data in order (hashes do not) and get better locality of reference (a very organized access pattern). If your data fits in memory then hashes are generally faster. If not, then BTrees are not.
      5. Yes. In high performance read-write situations, locking is important and how it is done is going to be your bottleneck. Most web applications are write seldom, read many times.
      6. Yes. Backup. And don't expect that binary data formats will be portable from machine to machine.
      7. If you want a website to scale, definitely. It is much easier to balance a load across 5 webservers than keep 5 databases in sync. However if you are anticipating this need, using a dbm solution will likely involve some custom work. Relational databases all have the data access segregated into its own process so the database can be moved to another machine. dbms traditionally do not.
      8. I think you are dramatically overestimating the needed resources.
Re: Databases and tied hashes.
by Trimbach (Curate) on Feb 11, 2001 at 07:40 UTC
    I've set up a CGI/GDBM database combo that currently has about 4,500 records with 15-25 added each day. The traffic is alot less than you're expecting (I get about 50 accesses per day) but the system works very well with no perceivable loss of performance.

    I'll bet that if you're not adding a ton of records you might be ok with GDBM. Even if the db is locked for reading, (does anyone know if GDBM does this?) the access is quick enough that even under the kind of load you're talking about you'll probably still be ok.

    However, I have recently (in the past few months) graduated to using DBI and mySQL and I have to say that the learning curve really isn't that steep and the "real" dbs make lots of things much easier to do than the tied hash dbs. I always hated rolling my own routines to split and join records to do inserts and updates. So much easier in mySQL. If you have mySQL (or similar) available to you you won't regret moving your project over... and you'll probably get to trim alot of your code in the process.

    Gary Blackburn
    Trained Killer

      If you are using complex records, then I agree that MySQL is better. But there are still some uses where I would still want to just pull out MLDBM. (Which eliminates all of your need to think about splitting fields yourself.)
Re: Databases and tied hashes.
by AgentM (Curate) on Feb 11, 2001 at 05:54 UTC
    Definitely go with a real SQL server like MySQL. For your intentions, this file-based DB is simply not extensible enough. The standard DBFile engines have a pretty much similar routine: tie the file in, read and write hashes loading them on demand with zero amortization- that means, while they are simple to set up and use, their use really burns out at more than a few users, especially since no more than one user can access the DBFile at a time! (As far as I know, none of them implement record locking.) I see this as as a case of "I see no choice but to upgrade my existing systems to something that will be scalable enough to handle the load." Good luck!

    Update: Yes, tilly. Very good, tilly. Next time read the main node and make sure that what you say is relevant- e.g. ala GDBM.

    AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.
      Please stop spreading FUD. Something doesn't have to have a SQL interface to be powerful. Berkeley DB supports applications with multi-threaded concurrent read-write access to databases of hundreds of terabytes at rates of over 30,000 accesses per second. Much of this native functionality is not available through the DB_File interface, but it is through BerkeleyDB.

      Now if you care about the complexity of your data model, then MySQL wins. If you care about preserving your data, in my professional opinion a good dbm used properly is as good or even a better choice than MySQL. If you want both, then you can use Postgres, Oracle, etc. But while I admit that the tied Perl interface is not the snappiest (though as some have discovered, it is pretty good), I do not believe that for straight lookups you are going to find any relational database that will beat a C application using Berkeley DB. It ain't going to happen, and it ain't going to happen for a lot of reasons.

      So unless you know something about scalability that I, Sendmail, Cisco, Ask Jeeves et al don't, please don't fall into the trap of thinking that dbms are just lightweight toys.

      For more information I suggest you visit Sleepycat Software.

Re: Databases and tied hashes.
by OeufMayo (Curate) on Feb 11, 2001 at 05:59 UTC

    I never worked with the GDBM_File module, but one solution, if you aim flexibility, is to use the DBI module. It provides a common interface to every kind of database (well, nearly), ranging from plain text files to Oracle.

    You can choose to use any database, and should you change it over times you will only have to modify a couple of lines in your program.

    You don't really need to be an expert of SQL to insert and query records in your database (but it sure can help) as the DBI module provides tons of examples on how to do these kinds of operations. You might also be interested in Programming Perl DBI a really good book on database programming with Perl.

    <kbd>--
    PerlMonger::Paris(http => 'paris.pm.org');</kbd>
Re: Databases and tied hashes.
by chorg (Monk) on Feb 11, 2001 at 23:26 UTC
    All good points, but why even worry about it? Grab MySQL, take 20 min and install it, use the DBI, and then your system will be good up to about 1-2 million records (Use Postgres for more) _plus_ no locking worries about your data. It'll take the same amount to time to install the latest Berkley DBM. Your talking about expansion - and using a web server to handle access to your data , and dont forget about generating reports based on the nature of your data, adding and deleting records, type safety etc. If your are unfamiliar with SQL, the time to get aquainted is now, as opposed to when there are 200 000 records in your dbm.
    _______________________________________________
    "Intelligence is a tool used achieve goals, however goals are not always chosen wisely..."
      Using MySql at work, I just want to warn you about some weakness :

      Error logging. It's especially difficult to know what wen wrong during an import.
      Advanced SQL order. some useful features like sub-select are lacking in my version (not the latest although).
      crash. unexpected RARE crash of the server occurs sometimes, with no data-loss/corruption, just restart the server and it's ok (I've read on the net that other site had similar behaviour)

      BUT

      Mysql is amazilgly FAST and easy to setup !(nothing compare to ORACLE or even Postgres...)
      The Perl DBI interface to Mysql is simple and powerful (BTW shouldn't we state that any sentence with Perl and easy is a pleonasm ? ;-)
        I hear you about lack of advanced SQL commands in the earlier versions - I was using it at work as well - when I started, I did not miss/know about subselects and the like, but when I was looking at a 7 line query to make a reports page, I realized that I really needed all those features - hence Postgresql... I still love MySQL though - I'll check the newest verson - if it has the features I need then I'm comin' home!!!
        _______________________________________________
        "Intelligence is a tool used achieve goals, however goals are not always chosen wisely..."