Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

WHY copying does happen (fork)

by Anonymous Monk
on May 05, 2008 at 16:50 UTC ( [id://684680]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I have a large dataset, which I want to access in my forks(). I only need to read the data so to save memory I would like to avoid copyOnWrite here.
While I get the "big picture" of copyOnWrite, I still have some places in my code, where copying does take place and I just don't understand why. I am using Linux::Smaps and use some code similar to the example given here, mainly looking at the shared_dirty and private_dirty values.

example: I have an (hopefully shared) array with pointers to (hopefully shared) data arrays. When I do for example

foreach my $arrayPointer (@$pointerList) { };

the shared_invalid value decreases and the private_invalid value increases alot, sometimes several hundred kilobytes. But shouldn't it just read?
I asked google for help and searched here, but somehow I don't find anything to answer my big questions:
- WHY does copying happen
- WHAT can I do to avoid it
- do I even look at the correct variables or should I check other ones? - is there a kind of "copyOnWrite for dummies" document somewhere on the net, if possible perl related?

thanks ahead for sharing your wisdom :)

Replies are listed 'Best First'.
Re: WHY copying does happen (fork)
by BrowserUk (Patriarch) on May 05, 2008 at 19:43 UTC

    One of the common causes of this in Perl is if you've read your shared numeric data in from a file.

    When you assign that data to the scalars in the array, the data is still stored as strings: '123'.

    But then you come to use those values as a part of a numeric expression, and perl converts the string stored in the PV slot of the scalar and assigns the binary numeric value to the IV or NV slot.

    Bang! You just made a non-mutating reference to a single shared value and causes a 4k page to be copied. Iterate you're entire array summing the numbers and you'll cause the whole array, plus everything else on each page that contains any of your arrays scalars to be copied also. A prime example of halo slippage on the "threads are spelt f-o-r-k" holy COW.

    A practical tip: If your shared arrays are numeric and read from files or a DB, add zero to them as you assign them:

    my @sharedArray = map 0+$_, split ' ', $lineOfData;;

    Not only will you not cause COW when doing math with them after forking, the array will be smaller to boot. Adding zero forces the conversion of the string you read into a binary numeric before it is assigned to the SV, which means no PV will be allocated and you save space. And as they are already numeric, using them in a numeric context won't have to convert them and so no mutations of the SV and no COW.

    Of course, that only holds true until you use them in a string context. If you need to print the out to another file, or the terminal, use printf instead of print and interpolation.

    my @a = 1.. 1e6;; ## takes 62 MB. printf "the number is: %d\n", $_ for @a;; ## causes no memory growth print "the number is: $_\n" for @a;; ## Causes the memory to gro +w to 110MB.

    Use interpolation on shared data and the growth would be far higher because unless you are very careful in how you populate the original array, the scalars it consists of will occupy space in 4k pages shared with other data, and they'll be copied also.

    Do it in 2 or more forked children and they'll all get their own copies. 100MB of shared numeric data and 5 forked children and you can see the total memory requirement blossom to well over 1GB just cos you interpolated the numbers.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks, BrowserUk. An extremely clear explanation of something I had no idea I needed to know... until the explanation.

      :-)

      -Bib

Re: WHY copying does happen (fork)
by Corion (Patriarch) on May 05, 2008 at 16:53 UTC

    Perl uses reference counting for variables, and hence even reading a variable means increasing the reference count, reading the value, and then decreasing the reference count again. So even a read can cause a memory write in Perl. You don't show much code, so I can't tell whether that's the case with your code, but I think that's a propable explanation.

      Could you explain why reading should increase the refcount?
      use Devel::Peek; my $x = 3; Dump $x; fork; Dump $x; __END__ SV = IV(0x90e54bc) at 0x90ca648 REFCNT = 1 FLAGS = (PADBUSY,PADMY,IOK,pIOK) IV = 3 SV = IV(0x90e54bc) at 0x90ca648 REFCNT = 1 FLAGS = (PADBUSY,PADMY,IOK,pIOK) IV = 3 SV = IV(0x90e54bc) at 0x90ca648 REFCNT = 1 FLAGS = (PADBUSY,PADMY,IOK,pIOK) IV = 3
      The refcount isn't changed, even though a variable is read from a forked process. And why should it? I see no need for that.

      The matter is different when variables go out of scope: then their refcount is reduced, thus possibly modifying it before deallocating.

      Update It's probably this you meant:

      use Devel::Peek; my $x = 3; Dump $x; my $y = \$x; Dump $x; __END__ SV = IV(0x97634bc) at 0x9748648 REFCNT = 1 FLAGS = (PADBUSY,PADMY,IOK,pIOK) IV = 3 SV = IV(0x97634bc) at 0x9748648 REFCNT = 2 FLAGS = (PADBUSY,PADMY,IOK,pIOK) IV = 3

        If you have a complex data structure and iterate over it, I expect refcounting to increase and then decrease the refcount:

        C:\>perl -MDevel::Peek -e "$x=[{foo=>1}]; Dump $x; $a=$x->[0]; Dump ($ +a); " SV = RV(0x1aa4ffc) at 0x1a93898 REFCNT = 1 FLAGS = (ROK) RV = 0x1a44bac SV = PVAV(0x1a45f9c) at 0x1a44bac REFCNT = 1 FLAGS = () IV = 0 NV = 0 ARRAY = 0x1a4baac FILL = 0 MAX = 0 ARYLEN = 0x0 FLAGS = (REAL) Elt No. 0 SV = RV(0x1aa4ff4) at 0x1a44bb8 REFCNT = 1 FLAGS = (ROK) RV = 0x1a44a98 SV = PVHV(0x1a92fec) at 0x1a44a98 REFCNT = 1 # <- only one reference FLAGS = (SHAREKEYS) IV = 1 NV = 0 ARRAY = 0x1a4b9b4 (0:7, 1:1) hash quality = 100.0% KEYS = 1 FILL = 1 MAX = 7 RITER = -1 EITER = 0x0 Elt "foo" HASH = 0x238678dd SV = IV(0x1a989f0) at 0x1a44b88 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 1 SV = RV(0x1aa4fd4) at 0x1a938e0 REFCNT = 1 FLAGS = (ROK) RV = 0x1a44a98 SV = PVHV(0x1a92fec) at 0x1a44a98 REFCNT = 2 # <- now two references FLAGS = (SHAREKEYS) IV = 1 NV = 0 ARRAY = 0x1a4b9b4 (0:7, 1:1) hash quality = 100.0% KEYS = 1 FILL = 1 MAX = 7 RITER = -1 EITER = 0x0 Elt "foo" HASH = 0x238678dd SV = IV(0x1a989f0) at 0x1a44b88 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 1

        I guess you can do clever things with aliases to avoid refcounting (as the stack currently is not refcounted), but it's easy to miss just one of these.

Re: WHY copying does happen (fork)
by almut (Canon) on May 05, 2008 at 17:01 UTC
    - WHAT can I do to avoid it

    In addition to what Corion said, you should also avoid implicit type conversions (e.g. number to string), or - if they can't be avoided - force them before doing the fork...

    (also see fork(): where does copy happen?)

Re: WHY copying does happen (fork) (pages)
by tye (Sage) on May 05, 2008 at 17:48 UTC

    "copy on write" only works at a granularity of a "page" of memory. A complex data structure is likely allocated "all over" the heap so even if you don't change any of it, you likely change other things on the pages that the parts of it ended up on. Add reference counting to that and you are very unlikely to keep much of any part of the heap shared (which is where Perl puts nearly everything). Allocate a huge string and don't change it and most of it (except some of each end) will stay shared.

    - tye        

Re: WHY copying does happen (fork)
by pc88mxer (Vicar) on May 05, 2008 at 19:56 UTC
    Given that this is the second time in the last few months that this issue has come up, it might be a good idea to bring it to the attention of the perl5porters mailing list. Someone with in-depth knowledge of the internals might realize there is a simple workaround.
Re: WHY copying does happen (fork)
by perrin (Chancellor) on May 05, 2008 at 17:03 UTC
    Try looking at the variables before and after with Devel::Peek. That may tell you what changed.
Re: WHY copying does happen (fork)
by pc88mxer (Vicar) on May 05, 2008 at 18:25 UTC
    How about using a module like DB_File, BerkeleyDB or CDB to share the data between processes? You'd have to redevelop your interface to the data, but it would be a memory efficient way to share it.
Re: WHY copying does happen (fork)
by sundialsvc4 (Abbot) on May 08, 2008 at 15:56 UTC

    What you'd like to have, and I don't know if you can have, is either "several true-threads, all of which therefore operate in one memory-space, or you want a data-structure that exists in a single shared memory-space that is truly attached-to by everyone. I don't know how you do that in Perl.

    Basically, what you're trying to achieve here is a read-only shared data-store, and I have no doubt that it is the reference-counting that's causing that multitude of writes. I wish that I understood the whys-and-wherefores of the source code to tell you just how you could achieve “manageable, clean source-code” while avoiding those to-you-useless and unwanted writes.

    I frankly don't share the notion that you should use a tied-hash here, because if virtual-memory works for you it sure does have lots of benefits. But not when the VM manager of the operating system thinks it's volatile because you're whacking reference-counters... Those {how's the system to know they're...} unnecessary COW-faults are a serious recipe for thrashing, as no doubt you've seen.

      yeah, I would like to have this data structure and would be able to access it from all forks / threads. In theory threads would be nice, but sadly the way they are currently implemented in perl (shared data are not really shared, everything is damn slow) make them close to useless for us :(
      I am afraid I have to accept that these (for us uneccesary) write do happen and that we can not really avoid them. I am currently thinking of how we could rewrite the whole application to get the needed data from somewhere else (some suggested a in memory database) without taking a too large performance hit.
Re: WHY copying does happen (fork)
by stiller (Friar) on May 07, 2008 at 05:27 UTC
    Do you read the data in before or after forking? I haven't used shared memory directly, so I might be way off here, but I thought that you would have to fork first, then build up the shared data between the processes?
      i read the data in BEFORE forking.
      when you fork, each thread is a kind of replication of the original. via copyOnWrite the original data is shared. if one fork writes, it creates a copy of the original data and does its modifications. So no other fork can see the modification. sharing the data between different forks is expensive and slow
      my idea was that since I only read the operating system shouldn't need to create copies (slow and requires memory), but after reading the answers here it seems that even with readOnly perl does enough writing to the data to let the OS start the copying :(
      thats not nice but I guess I have to live with that and find other solutions like the in memory databases mentioned...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://684680]
Approved by almut
Front-paged by friedo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-03-28 22:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found