http://qs321.pair.com?node_id=57696


in reply to Databases and tied hashes.

This is something I had a serious interest in at one point, so you are going to get a longer than expected answer... :-)

In answer to 1, whether there is a size limit depends on the database format. GDBM and DB_File are both essentially unlimited. (On 32-bit platforms your main limit is the fact that pointers are only able to address 2 GB of data. If you have large file support, I don't know what GDBM can do, but DB_File is OK out to hundreds of terabytes of data, disk permitting.)

In answer to 2, the entire database is not loaded into memory. However hashing algorithms have very poor data locality by the design of the algorithms. Therefore for very large data sets you will have to hit disk for every request. Therefore if you anticipate hundreds of megs or more of data to deal with, you may wish to switch to using DB_File so you can use a BTree. (Which has great locality of reference. In practice with very large active data sets it touches disk less than once per request.)

Beyond that more issues. First of all if you are editing the data set from 2 CGI scripts at once, you can easily get data corruption. The odds of this are fairly low, but by no means non-existent. The documentation and the Perl Cookbook suggest that you may safely lock the filehandle to the database and offer sample code. At least for DB_File this is not true. Instead you must lock a sentinel file. I offer some reasonable code for this at Simple Locking. (You would want to lock attempts to access for write, not for read.) Alternately you can use the newer BerkeleyDB module with a recent Berkeley DB to use native transactional support. After detailed discussions with Sleepycat on the needs of CGI scripts, I cannot recommend this for critical data as there are some non-recoverable disasters.

Secondly there are a variety of conditions that can result in data corruption (killing scripts at the wrong time, hardware failure, etc) and therefore you should not only have regular tested backups, but you should have backup data in a plain-text format. (This is generally good advice.)

Thirdly read your database documentation. Frequently it will have notes about having to be stored locally, not over NFS. Take this seriously. If you wish to have a dbm read-write accessible from multiple machines, you will want to produce a dedicated server running on top of it and connect to that server from the various machines. (This is what traditional relational databases do. It is slower. It increases latency. But it will allow you to scale much better.)

Fourth note that SQL database and dbm databases are very different in design and concept. The usage patterns that work with them are very different. Therefore while an application may be ported from one dbm to another, or one SQL database to another, in general porting from using a SQL database to a dbm or vica versa is a different kettle of fish. As you have found out, dbms have a very simple data model that maps very well to how Perl works. SQL databases have a much more complex model and derive a great deal of their power from their ability to figure out how to connect information on the fly in interesting ways.

In general a relational database can handle much more complex data models than a dbm database can, but there is a bit of a learning curve in using one. And it is fairly common to see an SQL database being used in a pattern that is more appropriate for a dbm, or the grown-up relative of a dbm, an OO database.

As an internal note, dbms and relational databases use the exact same techiques to be able to do fast accesses of data. Indeed the Sleepycat folks (who maintain Berkeley DB, which DB_File accesses) last I heard are seriously thinking of adding an optional SQL front end on top. What this would mean is that a collection of dbms would be organized and accessible in a relational manner.

  • Comment on Re (tilly) 1: Databases and tied hashes.