Re^2: Tracking down memory leaks

I can't imagine code like that working in this application since the files can be quite large. Here is an outline of how the script works:

Prepare several DB SELECT statement handles that will be used inside the loop to get useful information.
Create tied hashes for caching that information so that I don't have to hit the database everytime I need the id of some frequenly used term.
Create an IO object that will parse the file line by line and hand back information about the line in an OO way.
Loop using the IO object's next_feature method. Do lots of bookkeeping using the tied hashes. Write output to several (about 10) files that will later be loaded into postgres using COPY FROM STDIN.
Close open files, destroy DB statement handles, and load data into database.

Scott
Project coordinator of the Generic Model Organism Database Project

Comment on Re^2: Tracking down memory leaks

Replies are listed 'Best First'.
Re^3: Tracking down memory leaks by dragonchild (Archbishop) on Apr 13, 2005 at 13:46 UTC
Create tied hashes for caching that information so that I don't have to hit the database everytime I need the id of some frequenly used term. And you're wondering why your memory usage is increasing? Why don't you try a run with caching disabled and see if that fixes your problem. My wife's blog	[reply]
Re^4: Tracking down memory leaks by scain (Curate) on Apr 13, 2005 at 13:56 UTC
That's why I tied them to DB_File--I can watch those files grow as the program runs, but there're only 4 of them and they only grow to only about 30MB each. Scott Project coordinator of the Generic Model Organism Database Project	[reply]
Re^3: Tracking down memory leaks by Anonymous Monk on Apr 13, 2005 at 14:48 UTC
Create tied hashes for caching that information so that I don't have to hit the database everytime I need the id of some frequenly used term. Did you benchmark this? Repeatedly asking the database for the same thing might not be so bad if your database is good in caching. But tied hashes in Perl are slow. There are many factors involved, and what's best will vary from setup to setup, but don't dismiss something for tied hashes too easily if it's performance you care about. Of course, this has nothing to do with your memory problem.	[reply]
Re^4: Tracking down memory leaks by scain (Curate) on Apr 13, 2005 at 15:38 UTC
While it isn't related to the memory problem, we did think about this. The thing is, for each time through the loop, we have to hit the database about 10 times to obtain ids for a relatively small number of possible items. We are trying to eliminate the overhead of just hitting the database, not waiting for the query to finish, as it is very fast. It is that overhead that takes a while (comparatively). That overhead compared to a small BerkeleyDB database should favor BerkeleyDB. What I will probably do after I get these memory issues out of the way is offer as a command line option to either use in memory hashes or tied hashes, depending on the size of the file. Scott Project coordinator of the Generic Model Organism Database Project	[reply]


There's more than one way to do things
	PerlMonks