http://qs321.pair.com?node_id=128595

Ruby has a cool feature called copy-on-write which I abbreviate as COW. Perl needs to use this, and I'm going to implement it (as magic, if I can). As an exercise, I'm writing a simple language. A brief introduction is presented here. If you feel like playing with the language, by all means, go ahead, but be aware that it's not implemented as much as this documentation tells. This document is the plan, and I have yet to code it all. The source code isn't documented much at all, but once the language gets the features I wrote it to have, I'll spend time documenting it.

The documentation below is, itself, incomplete. I don't show how the copy-on-write part actually happens, but give me a day or so.

Oh, and if anyone here is good with malloc() and free() in C and can tell me where in my code I should be free()ing things, I'd be very grateful.


NAME

COW - How Copy-On-Write Can Be Implemented


SYNOPSIS

Copy-On-Write (or COW) is a concept that delays the copying of potentially large data (strings, specifically) as long as possible. It can be implemented via ``magic'', and this should be trivial for the Perl programming language.


DESCRIPTION


The Problem

Copy-On-Write (or COW) is a concept that delays the copying of potentially large data (strings, specifically) as long as possible. Here is an example:

  $big = "_" x 10_000;

$big is now 10,000 characters long. Let's say we want to store a substring of that in $small.

  $small = substr $big, 500, 2000;

$small is now 2,000 characters long. That's a lot to copy. :( What if we could alias $small to the 2,000 characters starting at position 500 in $big, and only copy the 2,000 characters when we needed to (when $big changes)?


Aliasing via tie()

That'd be cool. $small would be magical. Here's a demonstration:

  alias($small, \$big, 500, 2000);

  sub alias { tie $_[0], 'SubstringAlias', \$_[0], @_[1,2,3] }

  package SubstringAlias;

  sub TIESCALAR {
    my ($class, $orig, $targ, $pos, $len) = @_;
    bless [ $orig, $targ, $pos, $len ], $class;
  }

  # when we get $small's value
  # we return the substring of the
  # target variable
  sub FETCH {
    my ($self) = @_;
    my ($me, $str_ref, $pos, $len) = @$self;
    return substr $$str_ref, $pos, $len;
  }

  # when we store a value in $small
  # we should stop it from being an alias!
  sub STORE {
    my ($self, $value) = @_;
    untie ${ $self->[0] };
    ${ $self->[0] } = $value;
  }

Now $small is a tied variable. When we access its contents, the object that is tied to $small has its FETCH() method called:

  print $small;  # like (tied $small)->FETCH()
                 # where (tied $small) returns the object


Self-Aware Aliasing

However, what happens when $big changes?! $small would reflect those changes. We don't want that. :( We need a way to make $small get those 2000 characters at the last possible moment -- when $big changes. This ``change'' is called ``writing'', because we're writing some value to $big. We want a ``lazy'' alias, and we can implement this with another tie():

  #!/usr/bin/perl

  $big = "Perl is a wonderful language";
  alias($small, \$big, 5, 15);

  print "<$big>\n";
  print "<$small>\n\n";

  $big = "foo";

  print "\n<$big>\n";
  print "<$small>\n";

  sub alias { tie $_[0], 'SubstringAlias', \$_[0], @_[1,2,3] }

  package SubstringAlias;

  sub TIESCALAR {
    my ($class, $orig, $targ, $pos, $len) = @_;
    my $self = bless [ $orig, $targ, $pos, $len ], $class;

    # make $big a tied variable TOO!
    tie $$targ, 'AliasTarget', $targ, $self;
    return $self;
  }

  sub FETCH {
    print "* SubstringAlias::FETCH()\n";
    my ($self) = @_;
    my ($me, $str_ref, $pos, $len) = @$self;
    return substr $$str_ref, $pos, $len;
  }

  # stop being an alias if we
  # are assigned to
  # NOTE: unties both
  sub STORE {
    print "* SubstringAlias::STORE()\n";
    my ($self, $value) = @_;
    my ($me, $targ) = @$self;
    untie $$me;
    untie $$targ;
    $$me = $value;
  }

  package AliasTarget;

  sub TIESCALAR {
    my ($class, $orig, $copy_to) = @_;
    bless [ $orig, $$orig, $copy_to ], $class;
  }

  # just display our value
  sub FETCH {
    print "* AliasTarget::FETCH()\n";
    my ($self) = @_;
    return $self->[1];
  }

  # now we copy our value
  # to the alias before
  # we change our value
  # NOTE: unties both
  sub STORE {
    print "* AliasTarget::STORE()\n";
    my ($self, $value) = @_;
    my ($me, $str, $target) = @$self;
    untie $$me;
    $target->STORE(substr $str, $target->[2], $target->[3]);
    $$me = $value;
  }

Whew. That program works, and when run, produces the following output:

  * AliasTarget::FETCH()
  <Perl is a wonderful language>
  * SubstringAlias::FETCH()
  * AliasTarget::FETCH()
  <is a wonderful >

  * AliasTarget::STORE()
  * SubstringAlias::STORE()

  <foo>
  <is a wonderful >

The diagnostic print() statements show us what was going on. We see that when we ask for $big's value, we call its FETCH() method, and when we ask for $small's value, we call ITS FETCH() method, which in turn has to call $big's FETCH() method.

When we store the value ``foo'' in $big, we copy the substring of $big to $small before we untie the two of them and leave them as normal strings. That was a bit of work. Wouldn't it be nice if there was a language that did that all transparently?

Let's make one up!

Let's call it COW.


Introduction to the COW Programming Language

COW is a very simple language. It supports the following operations:

  # comments like Perl's

  # assigning a string to a variable
  var = "string";
  var = 'string';

  # printing a variable or a string
  print var;
  print "string";
  print 'string';

  # assigning another variable to a variable
  var = other_var;

  # "cow"ing another variable to a variable
  var = cow other_var;

Those are all pretty self-explanatory, except for the last one. The cow function means ``alias var to other_var until other_var is written to, at which point, copy the contents of other_var to var.`` That's a lot for that function to mean, but that's what it does. This is good, because if we never change other_var, we won't have to worry about the duplicated memory (if we copied other_var's contents to var, we'd have a copy of the string).

How is COW implemented? It's not all too hard. It takes a bit of ``magic'', though. Let's step through a program:

  name = "japhy";

This creates a scalar value (SV) in the program's symbol table. All SVs have an internal structure like so:

  struct _sv {
    list *dependecies;  /* who is aliasing me? */
    MG *magic;          /* what type of magic do I have? */

    int line;           /* what line number was I defined on? */
    int flags;          /* what flags do I have on? */
    int str_len;        /* what is the length of my string value? */
    STR str;            /* the char* representing the string */
  };

Whew. That's C code, if you're not familiar with it. Anyway, safe to say the first line of our program doesn't do much. Here's the internal view of the name variable:

  name -->      dependencies = NULL
                magic = NULL
                line = 1
                flags = 0
                str_len = 5
                str = "japhy"

Ok. Now let's look at the next line of the program:

  alias = cow name;

This creates another variable, alias, and tells it to point to name. This line does a lot of stuff. Let's see what it does to name first:

  name -->      dependencies = SV*(alias)
                magic = NULL
                line = 1
                flags = SVf_COW
                str_len = 5
                str = "japhy"

The dependencies list contains a pointer to alias's SV. Here's what alias looks like:

  alias -->     dependencies = NULL
                magic = {
                  type = MGt_COW
                  ftbl = {
                    get = mg_cow_get
                    set = mg_cow_set
                    len = mg_cow_len
                    clear = mg_cow_clear
                    free = mg_cow_free
                  }
                  object = SV(name)
                }
                line = 2
                flags = SVf_MAGIC
                str_len = 5
                str = SvSTR(SV*(name))

In alias's representation, the SVf_MAGIC flag has been turned on. The type of magic is MGt_COW. The ftbl is a struct of pointers to functions to call for the requested action. Finally, the object field holds the SV the magic functions have access to -- in this case, the place we read from (the SV of name). str is not a copied string, but merely a pointer to name's str field.

Onto the next line:

  print alias;

The print function works something like this:

  void op_print_str (STR *s) {
    printf("%s", s);
  }

  void op_print_sv (SV *sv) {
    /* this will update sv->str if needed */
    if (SvMAGIC(sv)) {
      MG *mg;
      if ((mg = SvMG(sv)) && mg->ftable->get)
        CALL(mg->ftable->get)(sv, mg);
    }
    op_print_str(SvSTR(sv));
  }

What's happening here is SvMAGIC(sv) checks to see if sv has the SVf_MAGIC flag on. If so, we extract the MG structure via SvMG(sv) and store it in mg. If that is true (not NULL) and there is a get function defined for it, that function is called -- the CALL() macro just turns a function pointer into a function:

  #define CALL(fptr) (*fptr)

The get function will probably update the str field of the SV structure. After that's done, we pass the str field, gotten via SvSTR(sv), to the op_print_str() function.


CREDITS

I borrowed a lot of the design of the scalars and magic from Perl. But that's good, since I hope to translate the implementation directly to a Perl magic structure.


AUTHOR

Jeff japhy Pinyan, [mailto://japhy@perlmonk.org].

Download the source at: http://japhy.perlmonk.org/COW/.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re (tilly) 1: Don't have a COW? Get one!
by tilly (Archbishop) on Nov 30, 2001 at 23:44 UTC
    I have some huge worries about this.

    I know why you want it. Ruby uses this (among other things) to support $` etc lazily without bugs. If you don't use the feature, you don't pay for it, and if you use it, you only pay for as much of it as you use. (By contrast Perl's hack with a global flag means that if you use it, you pay like heck.)

    However if you try to use copy on write with magic in Perl, you will get serious bugs from the fact that magic and local have a well-developed distaste for each other. This would give efficient $` etc. But only at the cost of introducing weird bugs involving mixing matches and local variables.

    Beyond that, any language with copy on write is best of (IMHO) using it aggressively and thinking through carefully the desired semantics. The reason to use it aggressively is that marking things copy on write is an excellent way to save lots of work on potentially expensive operations like assigning data to variables, and also it exercises your copy on write logic to catch any bugs. Both are good. (It is too easy to accidentally ignore copy on write somewhere and you want to catch that. Just like how the local and magic mechanisms in Perl ignore each other...) The reason to think about the semantics is that for things like argument passing there can be subtle differences between pass by value, pass by reference, and passing by aliasing with copy on write. Semantically the first and third resemble each other, but with profound performance implications. The second, in the right environment, can superficially look a lot like the previous two but with gotchas to be aware of.

    For instance in comparing Ruby and Perl, a key point that is often missed is that in Ruby assignment is by reference, not value. Given that most operations in Ruby create new copies, this works out fairly naturally. But beware!

    a = "Hello"; b = a; b << ", World"; # Note: += would not do this! puts a;
    I think this is a gotcha which is worth pointing out...

    Anyways Perl has a well-established semantic model. If it was to add copy on write, it should think carefully on how to update that model...

Re: Don't have a COW? Get one!
by belg4mit (Prior) on Nov 30, 2001 at 20:36 UTC
    a) Generalize for Arrays, Hashes and objects?
    b) rename($INC{'COW.pm'}, 'Clone/Lazy.pm')?

    --
    perl -p -e "s/(?:\w);([st])/'\$1/mg"

      I guess that would be the next step. I'm keeping it with scalars for now because it's simpler that way. And this wouldn't be a module, this would be a built-in magic behavior.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        Magic behavior for 5.8 or 6? If 6 then the apparent custom is to write a module that implements it for 5.6, no?

        --
        perl -p -e "s/(?:\w);([st])/'\$1/mg"

Beware the XS (was Re: Don't have a COW? Get one!)
by chip (Curate) on Dec 01, 2001 at 02:40 UTC
    Doesn't current third-party XS code depend on SvPTR() being owned only by the current SV? COW changes that....

        -- Chip Salzenberg, Free-Floating Agent of Chaos

      I wouldn't be making it third party code -- this would just be another builtin magic.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        I got that. My point is that the magic you're describing could easily break compatibility with third-party XS code.

            -- Chip Salzenberg, Free-Floating Agent of Chaos

Re: Don't have a COW? Get one!
by belg4mit (Prior) on Dec 01, 2001 at 05:40 UTC
    Something to keep in mind? If extended to arrays, doing an alias/COW on a lazy list.

    --
    perl -p -e "s/(?:\w);([st])/'\$1/mg"