Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

scripts stops running .. almost

by radu_marg (Initiate)
on Jul 17, 2009 at 10:03 UTC ( [id://780996]=perlquestion: print w/replies, xml ) Need Help??

radu_marg has asked for the wisdom of the Perl Monks concerning the following question:

I am using script which reads some information from a text file and saves data into an embedded database, which I run from a linux shell. The text file contains 100 million entries so the script needs over 10 hours to complete. The problem I have is that after running for 2-3 hours, the script almost stops: the process uses only 1% of the CPU and its progress virtually stops. I tried two different databases: Berkely DB and TokyoCabinet and the problem aoccurs with both. I do not think this is a buffering problem since I made sure the std output is unbuffered. What am I missing? Thank you, Radu

Replies are listed 'Best First'.
Re: scripts stops running .. almost
by mzedeler (Pilgrim) on Jul 17, 2009 at 10:14 UTC

    If you populate a B-tree style file, the performance on inserts quickly degrades with the number of records already inserted. Try measuring the insert performance as function of the number of records in the file. Using the hash variant that Berkeley DB supports should give you much better performance.

    Another option is to try profiling your script. That should provide an indication of where the bottleneck is.

Re: scripts stops running .. almost
by targetsmart (Curate) on Jul 17, 2009 at 10:11 UTC
    IMHO your program is eating all the available memory without releasing/reusing...
    how much percent of memory does that program take after 2-3 hours of running?.

    Vivek
    -- 'I' am not the body, 'I' am the 'soul', which has no beginning or no end, no attachment or no aversion, nothing to attain or lose.
Re: scripts stops running .. almost
by JavaFan (Canon) on Jul 17, 2009 at 11:48 UTC
    Give it some coffee! I slow down after 2-3 hours working full speed without coffee either.
Re: scripts stops running .. almost
by tweetiepooh (Hermit) on Jul 17, 2009 at 14:46 UTC
    Some databases don't write the data in properly without an explicit commit, filling up temporary structures. This is to ensure read consistency.

    Maybe there is some form of commit like statement that will flush this database buffer, remove locks, relock and start again with the next block of data.

      I've heard about this problem happening. I think tweetiepooh could be on to something important here. I was reading more about Berkley DB here:
      http://www.oracle.com/technology/documentation/berkeley-db/db/gsg_txn/C/index.html
      There is a lot of bookkeeping to keep track of a transaction and if you are in a situation where say 2 hours of inserts are one transaction which in theory could be aborted with no change to the DB, there's a lot of overhead there! A commit would say, "I'm finished with this one". I am not a DB guru. But I'm also wondering if there aren't some options that circumvent some of the normal transaction rollback and journaling for the case of a single user doing the initial DB create from scratch? I don't know. Just wondering if this initial build is somehow handled differently than the "online use" of thing thing once built?

      Update: I would leave ouput unbuffered until you get this working. But you should be aware that there is a significant performance penalty for that. In this case, we could be talking hours of difference! Get it working, then turn buffering back on and see what happens. Right now I am suspecting that tweetiepooh's idea of committing every 100 or whatever adds is gonna do something impressive.

      The original Berkeley DB didn't support transactions. Even though the newer versions does support transactions, it doesn't seem that DB_File uses supports it.

      If it works as I used to know it, every change is written straight to the file as it is done, but you can't use the file size as a safe measurement of every write.

      A different way to speed up the load is randomizing the order of the keys (or a pseudo random map of the keys themselves, such as MD5). I know it sounds odd, but if you are using B-tree storage and the keys are sorted, you get very long load times because the tree is constantly being rebalanced.

      My suggestion with regard to trying hash storage still stands. Try that first.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://780996]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-04-24 01:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found