Threads in Perl: Just leaky?

TheShrike has asked for the wisdom of the Perl Monks concerning the following question:

So, I'm modifiying a perl script I wrote for multi-threading. It takes about 30 minutes to run using 1-2% of the CPU on our machines, and with 10 threads doing the work at the same time it uses 10-20% of the CPU.... but it runs out of memory.

What the script did originally is read about 20000+ files, parse them for certain bits of information, and put that into a hash of array of hashes of arrays of hashes so that XML::Simple could output the relevant information into a neat 7.5MB xml file (later loaded into another script).

Now, I am basically trying make a thread for the parsing of each file, with a limit of 10 threads. I am running Perl 5.8.2, using threads and threads::shared. Here is a basic example of what I'm doing:

 
$hashone = &shared({}); 
$arrayone = &shared([]); 
$hashone->{"arraykey"} = $arrayone; 
for loop{ 
#some other stuff 
$hashtwo = &shared({}); 
push($arrayone, $hashtwo); 
$arraytwo = &shared([]); 
$hashtwo->{$filename} = $arraytwo; 
#add some other values to hashtwo

if ($#threads >=9) 
{ 
$thread = shift(@threads); 
$thread->join(); 
undef $thread; 
} 
push (@threads, threads->create(\&threadfunction, $filename, $arraytwo
+); 
} 
# clean up all the other threads, finish as I would without threading 

sub threadfunction 
{ 
($filename, $arraytwo) = @_; 

open(PIPE, "outputprogram $filename |"); 

while (<PIPE>) 
{ 
#add various $hasthree = &shared({}) to $arraytwo 
#add various values to $hashthree's 
} 
close PIPE; 
}
[download]

The undef $thread; I added due to another post I found saying that his threads werent giving up their memory otherwise, which didnt make any sense to him (nor to me). It did allow the program to run longer before running out of memory.

I tried sharing the pointers to the hashes/arrays outside of the thread. That did not help.

My only ideas were:
1. The array/hash/array etc. is taking up too much memory, but it wasnt before I added threads, so this makes no sense.
2. The pipes out of those files are taking up too much memory now that there's 10 of them. But each one is only 10kB of information so that's impossible.
3. Somehow, maybe due to the references not being shared, it was cloning the hash/array/hash etc. for each thread, which would get bigger in the first place and therefore bigger for each new thread as time went on. Except, if it were cloning them, then the original would not be getting much bigger at all. All the pointers are passed in by value anyway, and I have to assume all the values in a shared hash/array are shared (in fact, by the rules, I don't think I could add a non-shared anything to a shared hash/array). The only shared values I create in a thread are the arrays/hashes, and I point to those only in the main shared array/hashes in the parent thread.

I dont know if this problem is solvable, or if its something inherently wrong with perl threading and what I'm trying to do with it, but it would be nice to at least know why it's happening. Any ideas? Should I just leave it as a 30 minute process and code it in Java instead if I actually want it to work?

Comment on Threads in Perl: Just leaky? Download Code

Replies are listed 'Best First'.
Re: Threads in Perl: Just leaky? by BrowserUk (Patriarch) on Oct 04, 2007 at 21:07 UTC
I am running Perl 5.8.2 Upgrade! To at least 5.8.6, preferably 5.8.8. There is absolutely no point in trying to work around all the bugs that existed with threading in 5.8.2. Once you've upgraded, if your problem still exists you will need to post more information. Eg. What OS. You'll also need to post real (preferably cut-down), runnable code, rather than pseudo-code. Barring something very obvious, which probably won't show up in pseudo-code anyway, it's pretty much impossible to track down memory leaks without being able to run the code. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: Threads in Perl: Just leaky? by zentara (Archbishop) on Oct 04, 2007 at 19:33 UTC
I havn't been able to see an obvious leak in your pseudocode, BUT I have seen this thread leak thing quite a few times myself. If I was to hazard a guess, I would use the fact that a thread (at creation) gets a complete copy of the parent. Now if the parent is retaining references from previous threads, each new thread grows expotentially in size as it carries unwanted garbage forward. Now there may be some way to re-structure your code, using detach or something, which may release the memory. However this is all "hit-and-miss" and you may spend days trying, and not find the right combination to work right. BrowserUk is very good at this type of thread queue stuff and maybe he will spot a way. I tend to rely on clunkier, but sure fire methods, so I don't have to think too hard. :-) If you want a sure fire way to prevent memory gains, setup reusable threads. If your limit is 10 threads, at your program start , create 10 sleeping threads and feed them data to process thru shared variables. When it is done processing, return it to sleep, and push it's thread object into a shared array called "@ready". Now you just monitor @ready and shift off an available thread instead of creating a new one. I have an example using Tk, that juggles 3 threads, you can up it to 10. It's a contrived example, so expect some uneeded junk in this code. Read more... (10 kB) I'm not really a human, but I play one on earth. Cogito ergo sum a bum	[reply] [d/l]
Re: Threads in Perl: Just leaky? by Joost (Canon) on Oct 04, 2007 at 20:15 UTC
My (limited) experience with perl threads suggests that creating threads is a quite expensive operation where you also have to be careful about the amount of objects that get copied from the parent to the new thread. What I did to get around this issue is to use a very limited main thread (basically, only load the modules you want to use, if that), then start all the threads you're going to need (depending on usage, I use 1 worker thread for each CPU, but if your process is disk-bound you may want more) Then possibly load more stuff into the main thread and do all communication between the main and worker threads using Thread::Queue, or a few other shared objects. Trying to share a load of nested objects will probably result in a lot of overhead. In any case: try to limit the number of threads you create and join - re-use as many as possible (i.e. all of them). That way, you can get quite stable memory usage. update: also, you may want to get the latest stable perl (5.8.8), but at least you'll want the latest versions of threads and threads::shared from CPAN. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re: Threads in Perl: Just leaky? by renodino (Curate) on Oct 04, 2007 at 21:19 UTC
In addition to BrowserUk's suggestion, be sure to install the latest versions of threads and threads::shared, which are now dual lifed, and have had numerous fixes applied in the past 12-18 months. Perl Contrarian & SQL fanboy	[reply]
Re^2: Threads in Perl: Just leaky? by chrism01 (Friar) on Oct 05, 2007 at 06:33 UTC
'Dual lifed' ?? Can I have sub-titles (English) for that ? BTW, I agree with using re-usable thrs instead of creating/destroying each time. It solved a mem leak prob for me, although I don't work there any more, so I can't confirm version numbers. Suffice to say it was a prog that had to run 24/7, so even a small leak eventually killed the machine. Has been running fine for ages now. Cheers Chris	[reply]
Re^3: Threads in Perl: Just leaky? by demerphq (Chancellor) on Oct 05, 2007 at 08:54 UTC
Dual-lifed means the module is available both on CPAN and as part of a Perl release. Most modules available as part of the Perl distribution start out as CPAN modules and are then "added to core" at a later point. However some modules see their initial release as part of the core distribution, and occasionally are then split out. This process is often called "becoming dual-lifed". --- $world=~s/war/peace/g	[reply]
Re: Threads in Perl: Just leaky? by talexb (Chancellor) on Oct 05, 2007 at 14:32 UTC
So, I'm modifiying a perl script I wrote for multi-threading. It takes about 30 minutes to run using 1-2% of the CPU on our machines, and with 10 threads doing the work at the same time it uses 10-20% of the CPU.... but it runs out of memory. What the script did originally is read about 20000+ files, parse them for certain bits of information, and put that into a hash of array of hashes of arrays of hashes so that XML::Simple could output the relevant information into a neat 7.5MB xml file (later loaded into another script). Without even reading further than this, the answer that pops into my head is 'Use a database!' It's all very well to stash stuff into a hash, but (as you've discovered) this doesn't scale well. That's one of the things that databases are great for -- they take care of filing that stuff away for you, then giving it back later. If you're leery of setting up a database, I can highly recommend SQLite as a solution. Tiny and powerful, there's nothing to configure. Just point it to a database and start adding stuff. It just works, and performance is fantastic. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re: Threads in Perl: Just leaky? by NiJo (Friar) on Oct 05, 2007 at 18:35 UTC
I'll assume your general goal is to reduce wall clock time. With a huge number of small files, you are most probably bound by disk seeks. Threading helps to do something with the CPU during these seeks, but does not attack the root cause. I've not seen an OS interface for optimized reading from many files. In your single threaded application I'd think about defragmentation on the file reads. In a preliminary phase you can 'stat' all input files, sort the list by inode and read it one by one. You won't hear disk head movements anymore. 'Threading for the lazy' would not require changing perl versions. It involves splitting your application into three processes that work over an OS pipe. The first part would do the 'stat' task. It pipes the sorted file names to the dumper via STDOUT. Include a large output buffer to separate 'stat' seeks from 'read' seeks. The second process ('dumper') is designed to be waiting on I/O most of the time. After dumping a file contents it sends some 'EOF' token to the mostly unchanged interpreter process. Your envelope could be a shell script 'stater \| dumper \| interpreter' or a pipe 'open' variation in perl. On Linux the CFS I/O scheduler and 'ionice -c 3 <program>' prioritizes other processes. Unless you now need to optimize away being bound by a single CPU core in the interpreter I'd not think further about threading.	[reply]
Re: Threads in Perl: Just leaky? by weismat (Friar) on Oct 05, 2007 at 21:05 UTC
From my pov your main problem is that you create a thread for every file. This is a language-independent issue. I would suggest that your working threads take the filenames from a shared queue and process filenames until the shared queue is empty. Thus you have the overhead for creating threads only 10 and not 20.000 times. Check Thread::Queue for the shared queue. I am using threads heavily in real-time programs and have not seen any memory problems which are related to threads.	[reply]
Re: Threads in Perl: Just leaky? by MarkusLaker (Beadle) on Oct 07, 2007 at 11:00 UTC
Creating and deleting all those threads will flay the heap. It's possible that you're experiencing heap fragmentation, in which the heap has lots of free blocks but none quite large enough to create a new thread. Moving to a boss--worker model, in which you create a pool of threads at startup and then distribute work to them in round-robin fashion, should help. Whether threading helps performance will partly depend on your storage hardware. I happen to work for a maker of high-end file servers, handling hundreds of terabytes of storage each. If you send ten concurrent requests to one of those beaties, all those NFS or Cifs latencies will be handled in parallel, and the chances are that you'll be seeking on ten separate disks at once. That means you'll get nearly ten times the performance (and the client will be the limiting factor). OTOH, if the files are stored locally and there's only one direct-attached disk that's struggling to cope, multi-threading will help a little (especially if the OS is clever enough to do elevator seeking on the disk), but don't expect wonders. Finally, consider moving to `fork`, rather than Perl threads. A while ago, I wrote a Linux-based Telnet proxy -- Telnet in, Telnet out. (It does logging, connection-sharing and a few other things, but proxying is the essence of it.) At startup, or when you add new ports at runtime, it forks two threads per server port; there are typically between fifteen and forty server ports per proxy. The threads communicate with each other using socket pairs. The proxy runs for months at a time without any apparent memory leaks or performance problems. I can recommend that approach, if it fits the problem you're trying to solve.	[reply] [d/l]
Re: Threads in Perl: Just leaky? by Anonymous Monk on Oct 05, 2007 at 08:18 UTC
Please indent your code properly. It will help you and it will help us help you.	[reply]


laziness, impatience, and hubris
	PerlMonks