http://qs321.pair.com?node_id=741776


in reply to How do you parallelize STDIN for large file processing?

The very easiest way you could do this would be to divide your input file into N pieces and process each piece in parallel by starting your script on that piece. Have each process output to a file of its own and stitch them together at the end. If you start one process per CPU and the black-magic you do is CPU bound you could get something like a linear speedup.

That of course means you can't take data strictly from STDIN, but really that's a silly way to process 4.1GB of data!

If you really, really need to read it from a stream and output it in-order then you'll have to have a parent reading the stream, forking kids and then collecting the results in buffers so you can output in-order. Start with Parallel::ForkManager, which will handle giving out the work and then mix in some IO::Pipe and IO::Select for collecting the results. Be sure you divide the work into sizable chunks, forking a new child for each line isn't going to help very much!

Now go give it a try and don't come back until you have some code to post!

-sam

  • Comment on Re: How do you parallelize STDIN for large file processing?