Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

To Fork or Not to Fork. Tis a question for PerlMonks

by pimperator (Acolyte)
on Jul 01, 2014 at 01:15 UTC ( [id://1091787] : perlquestion . print w/replies, xml ) Need Help??

pimperator has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I have a question about forking processes using the ForkManager module. I'm running a script that will parse a large file, there are about 10,000 of these files. The initial script I wrote forks the process into 15 child processes that parses one of the 10,000 files. It took around 5 days for the code to finish (person running it pauses it sometimes). The code processed 1 file every 1-2 minutes. I'm thinking because ForkManager only creates new processes and does not 'really thread' that the processing time would be the same if I chose not to fork. Is this true? Basically, would it take the same amount of time to parse 10,000 files one by one compared to parsing 15 of them at the same time? I'm asking this and not testing it because I'm writing a code for someone else to run on a computer that I cannot access due to sensitive information. So... yeah Tusen Takk Monks.

Replies are listed 'Best First'.
Re: To Fork or Not to Fork (bottle necks)
by tye (Sage) on Jul 01, 2014 at 04:17 UTC

    Forking certainly can make processing a large number of large files go much faster. We have a system that does exactly that and forking allows things to run many times faster. But then, our processing of files is mostly CPU-bound as we are transcoding the files and this runs on a system with 32 cores exactly because of this.

    We just finished benchmarks on a revamp of this and it is about 4x faster than it used to be (despite it previously forking more workers than there are CPU cores). The old process worked pretty much exactly like Parallel::ForkManager. The new strategy pre-forks the same number of workers and just continuously feeds filenames to them over a simple pipe.

    There are several advantages to the new approach. The children are forked off of the process before it has built up the list of files to be processed, which will often be a huge list, so there are much fewer copy-on-write pages to eventually be copied. The children live a very long time now, so there is less overhead from fork()ing (once per worker instead of once per file). The above two features also mean that it makes sense for the children to be the ones to talk to the database, which is probably the biggest "win". It also significantly simplified the code.

    If your processing of files is mostly I/O bound, then doing a bunch of them in parallel could actually be slower than doing them in serial. Though, I would expect that your processing of one file isn't perfectly I/O bound and having at least two running will provide some speed-up as often one can use the CPU while the other is waiting for I/O.

    Once you have enough processes that you are maxing out the available throughput of either your CPU cores or your I/O subsystem, then adding more processes will just add overhead.

    - tye        

      And a related question: Design advice: Classic boss/worker program memory consumption.

      Did you also consider/bench a threaded version?

      Might there be a practical way to manage multiple malloc areas in one perl process? A separate arena (mmap) per package? This could improve locality, and also segment the data so that CoW benefits are subdivided.

        No, we didn't consider using threads (nor what Perl 5 calls "threads"). If using threads offered any performance benefits for our use case, the benefits would be truly trivial.

        The cost of using threads in terms of complexity would not be trivial. In some cases, you can use Perl threads with the Perl code mostly untouched and the added complexity is mostly just hiding inside of the perl executable. Though it will also add operational complexity when it comes to trouble shooting and similar tasks. But that complexity inside the perl executable very often doesn't stay completely hidden (go read the p5p thread on why Perl threads have been "depreciated", if you aren't familiar with the typical experience of trying to use Perl threads).

        But I don't think this is even one of those cases. Since the workers spawn other executables to do parts of the transcoding and spawning jobs involves waiting for SIGHCLD and mixing signals and threads is often pointed out as a bad idea, I suspect that the Perl code would actually have to get significantly more complex.

        So I didn't even consider adding significant complexity in at least 2 or 3 layers for a possible but at-most-trivial performance gain and maybe a (likely trivial) performance decrease.

        - tye        

      Thank you for the detailed reply. To clarify, my code sends a command to the system to convert a large file into a larger file using another program. Because there is no I/O, it would be advantageous to use Parallel::ForkManager.

      But when I open each file and read them, it's better to do it in a serial fashion.

        Because there is no I/O

        It is true that your Perl script isn't doing the I/O, but it is very much not true that "there is no I/O".

        Whether Perl is doing the I/O or computing vs. some other program doing it has little bearing on the performance impact of having Perl fork() so you can have more than 1 instance running at once.

        - tye        

Re: To Fork or Not to Fork. Tis a question for PerlMonks
by salva (Canon) on Jul 01, 2014 at 07:31 UTC
    I'm asking this and not testing it because I'm writing a code for someone else to run on a computer that I cannot access due to sensitive information.

    Then you are unlucky, because this is one of those things that can only get to be know by trying!

    Not showing us your code doesn't help either.

Re: To Fork or Not to Fork. Tis a question for PerlMonks
by roboticus (Chancellor) on Jul 01, 2014 at 13:13 UTC

    pimperator:

    Programs not waiting for user input are dominated by CPU or I/O. If your processing is CPU bound, then splitting up the among the available processors can help. Similarly, if your processing is I/O bound, splitting your work over multiple I/O devices can speed things up significantly. However if your task is I/O bound and you can't split the work over multiple I/O devices, then splitting the work among several processes can make things go even slower.

    Keep in mind that I/O devices include the network as well as disk drives. I've had a couple jobs where file processing was limited by network bandwidth (the disk drives were in a SAN), so we overcame the bottleneck by splitting the job over several computers.

    I suggest you first find out what your bottleneck is, and then think about an appropriate strategy to split the work. In the (very rare) case that the non-dominated resource is still heavily used (e.g. you're using 100% of the I/O and 90% of the CPU), then splitting the work up may not gain you much, as you'll almost immediately hit the next bottleneck. Again, this isn't a common case.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      thank you for the reply. I imagine the computer's bottleneck is the I/O so I'll only do parallel computing for the CPU processes

Re: To Fork or Not to Fork. Tis a question for PerlMonks
by BrowserUk (Patriarch) on Jul 01, 2014 at 01:59 UTC
    The code processed 1 file every 1-2 minutes

    How big are the files?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: To Fork or Not to Fork. Tis a question for PerlMonks
by Preceptor (Deacon) on Jul 01, 2014 at 10:12 UTC

    Depends what the forked code does. CPUs and memory as a rule are way faster than disk IO. If there's a lot of processing to be done, forking (or threading) means grabbing more processors. If you're IO bound, then it probably won't help much - any gains are offset against contention.