http://qs321.pair.com?node_id=93532

TrinityInfinity has asked for the wisdom of the Perl Monks concerning the following question:

I have a half perl/half database question. I have a program that uses perl to basically log items to a flat file, and each item has its own comments file to go with it so folks can store remarks as they make changes to the item.
The boss is happy with what I've produced, but wants to know the upper limit on this flat-file setup I have. I'm unable to use mysql or the oracle systems they've got here, so this is the best I can do on my own without going to the IT folks to setup the project themselves. I do wonder though, how much can a single flat file database handle? When I'm working with a certain number of records, or a certain amount of data, would it then be wise to migrate to something else like mysql or oracle?

Replies are listed 'Best First'.
Re: When is a flat file DB not enough?
by jeroenes (Priest) on Jul 03, 2001 at 19:28 UTC
    Well, it depends on a lot of parameters. A short answer is: use bogus data to test your setup. Test it to its limits, and find out where you can expect problems.

    What kind of limits can you think of?

    1. Size limits. Maybe 'more' hardware can help out. Maybe you can compress the data with gzlib or the like.
    2. Speed limits. Maybe 'more' hardware as well. Maybe a migration to another DB system might help (BerkeleyDB did wonders for me).
    3. Usability limits. Think about multiple users accessing at the same time. BerkeleyDB or file locking scemes can help.

    RDMBS like oracle/mysql have the tendency to solve problems at all three levels.....

    Hope this helps,

    Jeroen
    "We are not alone"(FZ)

Re: When is a flat file DB not enough?
by lhoward (Vicar) on Jul 03, 2001 at 19:24 UTC
    It is a generally accepted fact that no flat-file database can support more than 5,173 items until performance starts to deteriorate rapidly. The maximum number of elements that any flat file database can support is limited to 23,987.


    Just kidding. The real answer is: it depends

    • How large are the items in the DB
    • In what ways are they accessed (read, written, updated)
    • How is the flat-file DB implemented ("by hand", DBD::CSV, or something else)
    • Are there any locking or atomicity issues to consider

    Strictly speaking, the amout of data that a flat-file DB can handle is only limited by the maximum filesize that your filesystem will support. However, at some point performance will become unacceptable. About the only way for you to come up with a good answer is to benchmark your system. Loading it with various amounts of data and benchmarking performance of various sample operations.

      I figured the filesize would be the key. This is all running on a solaris setup that handles files far bigger than mine, but I think it can handle a couple hundred items at best before the search and editing times start to get noticible. Thanks for the response! =)
        Don't think solaris is so easy with large files. Sure it can handle them, but the I/O bandwith may bite you!!!

        Linux is better with that kinda thing...

        Sinister greetings.
        perldoc -q $_
Re: When is a flat file DB not enough?
by thpfft (Chaplain) on Jul 03, 2001 at 19:54 UTC

    Short answer: too many variables to tell. You should be ok, though. I've got a CMS that sometimes has to read two megabytes of xml and manages it without becoming unusable. It's not the right way for me to be doing it, but it just grew, and it still works ok. I suspect that it will be issues of data integrity or structure that make you shift, not capacity.

    Anyway, if you use DBI and something like DBD::CSV to manage access to your flatfiles, then you'll have a ready-made layer of abstraction between the scripts and the data. in which case it kind of doesn't matter about the ceiling: if you seem to be reaching it, you can switch to something with more headroom just by changing a line or two in the top of the script (and getting the IT people to provide a bit of access to the databases. Not in that order.)

    There are useful discussions of DBI and DBM around, if you want to look into that further.

Re: When is a flat file DB not enough?
by petdance (Parson) on Jul 03, 2001 at 21:26 UTC
    There is no number N where N turns performance unacceptable. It's a number that you'll have to derive yourself, based on parameters of acceptability that you have to try yourself.

    It's misguided to try to find a magic number. What you'll have is a curve of performance, probably something like this

    Records => response time
    100 => effectively instant
    1000 => 1 sec
    10000 => 3 sec
    100000 => 10 sec
    1000000 => 5 minutes
    Once you derive that curve, you can say "OK, if we want to keep performance around 10 second response time, we can't have more than 100000 records, and if we want more than that, we're going to have to go to a bigger/different system."

    The other part of it is to define very carefully what that performance acceptability is. Is it OK if the user has to wait 1 second to do an update? 10 seconds? A minute? Define those, in writing, and make sure that you get that approved by anyone who could squawk. There's nothing worse than having your requirements changed out from under you, when your boss says "Hey, it's taking too long to update", because you thought that a 5-second lag was acceptable.

    xoxo,
    Andy
    --
    Throw down the gun and tiara and come out of the float!

      That's the kind of situation I'll need to explain to the bosses. What will keep the number of records down though is the users tendency to whine if things aren't as close to instant as possible, but also my own insecurities about the database. While I've made it to tbe best of my abilities, I still fear that someone will update it and it'll explode and loose all data - you know, something silly
      Still, thanks for that mini-table of numbers. Managers love pictures =)
Re: When is a flat file DB not enough?
by bikeNomad (Priest) on Jul 03, 2001 at 19:52 UTC
    BerkeleyDB has the Recno type, which can handle a flat file. The nice thing about getting BerkeleyDB to do it is that it handles the caching, locking, etc.
Re: When is a flat file DB not enough?
by toma (Vicar) on Jul 03, 2001 at 21:18 UTC
    32 bit operating systems typically have a file size limit of 2 gigabytes. Even if you have a 64 bit OS, many tools such as tar and ftp have trouble with files larger than 2GB, so don't exceed that.

    I also like DBD::CSV, and if you need to switch from a flat file to a database, DBD::CSV makes this easy.

    You didn't mention if you will need to load the whole file into memory at once. If you do need to load the whole file often, and it gets larger than about 50MB, it might be a good idea to use the Storable module. Storable stores and loads a perl data structure. It is amazingly fast and easy to use.

    The downside of Storable is that it creates a file that is machine-architecture dependent (that is, not particulary portable). Also, it uses a binary file format, so you won't be able to browse it with a text editor.

    It should work perfectly the first time! - toma

      Storable can be told to produce a portable structure by using the n* versions of its routines. It will then produce a network-ordered binary structure. So if you use nfreeze to serialize, just use nthaw to deserialize, and it should be portable.
Re: When is a flat file DB not enough?
by voyager (Friar) on Jul 03, 2001 at 20:58 UTC
    If you are doing individual DML commands (insert, update, delete), you'll be fine for a very long time.

    Where the flat file begins to have problems is if you have 100 inserts to do, and the flat-file interface (e.g., DBD::CSV) wants to read and write the file for each command.

    Why not test for your specific case? Create a large dataset and see how your app performs.

Re: When is a flat file DB not enough?
by sierrathedog04 (Hermit) on Jul 03, 2001 at 21:02 UTC
    By default the Apache webserver uses flat .htpasswd files to store user id's and encrypted passwords. One can upgrade from a flat file to a real RDBMS if performance requires it.

    And I have read that performance may require it if there are more than a few hundred users. So, a rough estimate is that your flat file should not exceed more than a few hundred lines.

    However, YYMV.

Re: When is a flat file DB not enough?
by mattr (Curate) on Jul 04, 2001 at 18:08 UTC
    There is also the issue of when do you need database-like functions, such as a query language, locking, or management of lot of data with presorted indices. The Berkeley DB site at sleepycat has an interesting What is/is not Berkeley DB? section. The is not page may be useful, because while BDB is a hugely powerful system that can handle terabytes of data, it is not SQL or relational, though you could build a database engine on top of it. This is a pretty interesting page, and it might make you start thinking about extending the life of your system..

    Also I would expect the <2GB limitation on machines that only use 32 bit addressing for your plain old flat file, but this may not apply to BDB.. their page says

    Because both keys and values can be up to four gigabytes in length, a single record can store images, audio streams, or other large data values. Large values are not treated specially in Berkeley DB. They are simply broken into page-sized chunks, and reassembled on demand when the application needs them.
    Now looking at Database Limits it looks like 2GB per database file would be correct for BDB on a 32 bit filesystem machine.

    Actually now I started wanting to read it from the beginning and maybe you will too. Hint, first page mentions flat files and structured data!

Re: When is a flat file DB not enough?
by Nitsuj (Hermit) on Jul 05, 2001 at 05:35 UTC
    Berkley DB has been mentioned a few times here. I think that this is probably the best idea in your situation (though I would strongly encourage migrating to a good sturdy DBMS (if you are sorting that much data). For just a couple hundred records, you will probably not notice much difference, but if this sucker is getting big... Well...

    Of course, you should use a database hash and that will sort things out nicely

    Another thought, is that you can breathe a LOT of life into flatfiles with a few simple methods.
    One simple one is to use your filesystem to provide some of the services of a DBMS, but I don't really recommend that (IE, directories labelled as user numbers)
    Another idea is to put some forethought into your file format. You can put tables in the front to sort the data, run simple hashes over files to speed up searching, use tree implementations and such. An important fact is that you don't actually need to visit each record in order to search a file. If the data is ordered, you can order records, and search positionally (kind of like the number game where the computer says "higher" or "lower"). This will speed up your search time exponentially (literally). Search time for a B-Tree is log(2)n visitations, through the judicious use of file pointers, this can be used in any which way you like. The simple fact of the matter though, is that implementing such a system is rather mind racking (there is probably a pm that allows something to this effect).

    I personally would try splitting your file across some criteria into smaller files, which can be searched flatly, and hand it off to IT for a proper DBMS if access time gets to be a problem.

    Just Another Perl Backpacker