Re: scripts stops running .. almost
by mzedeler (Pilgrim) on Jul 17, 2009 at 10:14 UTC
|
If you populate a B-tree style file, the performance on inserts quickly degrades with the number of records already inserted. Try measuring the insert performance as function of the number of records in the file. Using the hash variant that Berkeley DB supports should give you much better performance.
Another option is to try profiling your script. That should provide an indication of where the bottleneck is.
| [reply] |
Re: scripts stops running .. almost
by targetsmart (Curate) on Jul 17, 2009 at 10:11 UTC
|
IMHO your program is eating all the available memory without releasing/reusing...
how much percent of memory does that program take after 2-3 hours of running?.
Vivek
-- 'I' am not the body, 'I' am the 'soul', which has no beginning or no end, no attachment or no aversion, nothing to attain or lose.
| [reply] |
Re: scripts stops running .. almost
by JavaFan (Canon) on Jul 17, 2009 at 11:48 UTC
|
Give it some coffee! I slow down after 2-3 hours working full speed without coffee either. | [reply] |
Re: scripts stops running .. almost
by tweetiepooh (Hermit) on Jul 17, 2009 at 14:46 UTC
|
Some databases don't write the data in properly without an explicit commit, filling up temporary structures. This is to ensure read consistency.
Maybe there is some form of commit like statement that will flush this database buffer, remove locks, relock and start again with the next block of data. | [reply] |
|
I've heard about this problem happening. I think tweetiepooh could be on to something important here. I was reading more about Berkley DB here:
http://www.oracle.com/technology/documentation/berkeley-db/db/gsg_txn/C/index.html
There is a lot of bookkeeping to keep track of a transaction and if you are in a situation where say 2 hours of inserts are one transaction which in theory could be aborted with no change to the DB, there's a lot of overhead there! A commit would say, "I'm finished with this one". I am not a DB guru. But I'm also wondering if there aren't some options that circumvent some of the normal transaction rollback and journaling for the case of a single user doing the initial DB create from scratch? I don't know. Just wondering if this initial build is somehow handled differently than the "online use" of thing thing once built?
Update: I would leave ouput unbuffered until you get this working. But you should be aware that there is a significant performance penalty for that. In this case, we could be talking hours of difference! Get it working, then turn buffering back on and see what happens. Right now I am suspecting that tweetiepooh's idea of committing every 100 or whatever adds is gonna do something impressive.
| [reply] |
|
The original Berkeley DB didn't support transactions. Even though the newer versions does support transactions, it doesn't seem that DB_File uses supports it.
If it works as I used to know it, every change is written straight to the file as it is done, but you can't use the file size as a safe measurement of every write.
A different way to speed up the load is randomizing the order of the keys (or a pseudo random map of the keys themselves, such as MD5). I know it sounds odd, but if you are using B-tree storage and the keys are sorted, you get very long load times because the tree is constantly being rebalanced.
My suggestion with regard to trying hash storage still stands. Try that first.
| [reply] |
|
| [reply] |
|
|