http://qs321.pair.com?node_id=1086846


in reply to Design advice: Classic boss/worker program memory consumption

A typical solution is to have the code for the boss and the worker in different executables. When spawning a worker, the boss does a fork, immediately followed by an exec, which is much more memory friendly (on modern Linux/Unix systems, copy-on-write means you need barely extra memory for a fork when neither of the programs modify the data).Then have some kind of IPC to transfer the work unit to the worker.

That way the worker only needs modest amounts of memory, and spawning many of them isn't that expensive. No intermediate process either.

Replies are listed 'Best First'.
Re^2: Design advice: Classic boss/worker program memory consumption
by shadrack (Acolyte) on May 20, 2014 at 20:01 UTC
    When spawning a worker, the boss does a fork, immediately followed by an exec, which is much more memory friendly (on modern Linux/Unix systems, copy-on-write means you need barely extra memory for a fork when neither of the programs modify the data).

    copy-on-write does indeed make a fork more efficient, but you lose all that efficiency the minute you do an exec because the exec'd process starts over from scratch, loading its own separate copy of the perl interpreter, its own separate copy of the script, all the modules, etc. -- all taking up memory which would otherwise have been shared with the parent. This makes it LESS memory-friendly than just forking.

      You also lose all that efficiency when you start to read a lot of data structures in perl, because reading data ca n increase the reference counters, thus a write actually happens in the background, causing the operating system to copy the pages.

      So, don't assume stuff is memory unfriendly. Measure it. In your context, where it matters.

      Also, starting a new perl interpreter takes about 0.017s "cold" on my machine, and 0.002s when cached. Compared to a 1 minute run time of a worker thread, that's like 0.03% (cold) or 0.003% (warm) - nothing to worry about.

Re^2: Design advice: Classic boss/worker program memory consumption
by Laurent_R (Canon) on May 20, 2014 at 21:30 UTC
    The memory size is only part of the problem. The problem with spawning too many children also has to do with context switches.

    I am regularly working with seven separate databases, each having eight sub-databases (and huge amount of data). So, in principle, I could launch 56 parallel processes for extracting data from these 56 data sources. Most of the time, I don't use forks or threads, but simply launch parallel background processes under the shell. And the processes that I am referring to are sometimes in Perl, and sometimes in a variety of other languages (including a proprietary equivalent of PL/SQL), so that using the OS to fork background processes is often the easiest route: the shell and the OS manage the parallel tasks, the program manages the functional/business requirements.

    Our servers have usually 8 CPUs. We have been doing this for many years and have tried a number of options and configurations, and we have found that the optimal number of processes running in parallel is usually between 8 and 16, depending on the specifics of the individual programs (i.e. some are faster with 8 processes, some better with 12 and some better with 16, depending on what exactly they are doing and how).

    If we use less than 8 processes, we have an obvious under-utilization of the hardware; if we let 56 processes run in parallel, the overall process takes much longer to execute than when we have 8 to 16 processes, and we strongly believe that this is due to context switches and memory usage. In some cases (when the extraction need heavy DB sorting, for example), the overall process even fails for lack of memory if we have too many processes running in parallel. But in most of the cases, it really seems to be linked to having too many processes running for the number CPUs available, leading to intensive context switches.

    So what we are doing is to put the 56 processes into a waiting queue, whose role is just to to keep optimal the actual number of processes running in parallel (8 to 16 depending on the program). This is at least the best solution we've found so far. With about 15 years multiplied by 5 or 6 persons of cumulated experience on the subject. Now, if anyone has a better idea, I would gladly take it up and try it out.

      if we let 56 processes run in parallel, the overall process takes much longer to execute than when we have 8 to 16 processes, and we strongly believe that this is due to context switches and memory usage.

      Having to swap pages out and back in surely can make things run much, much longer. 56 processes talking to databases isn't really going to add enough overhead from context switching to cause the total run-time to be "much longer", IME. To get context switching to be even a moderate source of overhead, you have to force it to happen extra often by passing tiny messages back and forth way more frequently than your OS's time slice.

      There are also other resources that can have a significant impact on performance if you try to over-parallelize. Different forms of caching (file system, database, L1, L2) can become much less effective if you have too many simultaneously competing uses for them (having to swap out/in pages of VM is just a specific instance of this general problem).

      To a lesser extent, you can also lose performance for disk I/O by having reads be done in a less-sequential pattern. But I bet that is mostly "in the noise" compared to reduction in cache efficiency leading to things having to be read more than once each.

      Poor lock granularity can also lead to greatly reduced performance when you over-parallelize, but I doubt that applies in your situation.

      - tye        

        context switching is cheap

        Sorry Tye, but that is twaddle. Eiher guesswork, or naiveté.

        Firstly, which type of "context switch" are you mis-describing?

        • Thread-context.
        • Process-context.

        The cost of a hardware-based, thread-only context switch runs from 10's to over 1000 microseconds. Add in the need to invalidate and refill L1/L2/L3 caches, and the total time to re-consitute a thread-context back the point that real work can proceed can actually take longer than the maximum timeslice some (flavours of some) OSs will allocate to a thread.

        And don't go blaming threads; switching process-contexts is a (collection of) software operations and is more expensive still.

        The cost of context switch may come from several aspects. The processor registers need to be saved and restored, the OS kernel code (scheduler) must execute, the TLB entries need to be reloaded, and processor pipeline must be flushed. These costs are directly associated with almost every context switch in a multitasking system. These are the direct costs.

        In addition, context switch leads to cache sharing between multiple processes, which may result in performance degradation. This cost varies for different workloads with different memory access behaviors and for different architectures. These are cache interference costs or indirect costs of a context switch


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Having to swap pages out and back in surely can make things run much, much longer. 56 processes talking to databases isn't really going to add enough overhead from context switching to cause the total run-time to be "much longer", IME.

        Well I guess it depends on our definitions of what "much longer" mean. In our case, it can be up to twice longer, say four hours instead of 2. For us it is sometimes the difference between usable and not usable program.