Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Why are 5.10's named captures read only?

by blazar (Canon)
on Oct 19, 2008 at 11:15 UTC ( [id://718043]=perlmeditation: print w/replies, xml ) Need Help??

While writing my reply to another post, I wanted to experiment with 5.10's new named captures. But then the code included some modifications to be performed on one of them, and perl spit our an error which I can reproduce here with a simpler example:

spock:~ [13:01:13]$ cat guyov.pl ; ./guyov.pl #!/usr/bin/perl use strict; use warnings; use 5.010; $_ = "foo bar baz"; s/ (?<first> \w+)\s+(?<second> \w+) / $+{first} =~ y|o||d; "$+{first} $+{second}" /ex; say; __END__ Modification of a read-only value attempted at ./guyov.pl line 9.

Now, I easily understand why numbered captures should be read only, since they may be clobbered by another regexp-like operator. But from the UI POV, IMHO (sic!) if you use named captures, then you have control enough to know better: if you do want to clobber them, then it's your business, but I think it would be intuitive to use say $+{first} and $+{second} more or less as if they were the lexical variables $first and $second.

Is the limitation on %+ being read-only a technical one due to implementation details, or are there more sound motivations for it to be, which I may have easily overlooked?

--
If you can't understand the incipit, then please check the IPB Campaign.

Replies are listed 'Best First'.
Re: Why are 5.10's named captures read only? (tuit)
by tye (Sage) on Oct 19, 2008 at 17:22 UTC

    I actually have a module partially written that allows one to modify numbered captures.

    One problem with changing Perl itself to allow either named or numbered captures to not be read-only, is that Perl is prone to make such captures actually point to a copy of the original string.

    FYI, allowing modification of such interleaved substrings is an interesting and rather complex problem (especially the edge cases). But I'll write that stuff up later.

    Update: Upon reading more of the thread, I see that what was wanted was to be able to modify a capture variable as a glorified temporary variable. To be clear, what my module does is allow one to modify the numbered capture such that it modifies the string variable that the regex was matched against (while ensuring all of the other numbered captures are properly made aware of the update and, if sanely possible, remain modifiable).

    - tye        

Re: Why are 5.10's named captures read only?
by JavaFan (Canon) on Oct 19, 2008 at 18:53 UTC
    First on why numbered captures are read-only. I'm not quite sure about the reason, but I think it's a performance issue. Captures aren't actually stored as copies, but as indexes into the string. This saves a lot of copying, and gives better performance.

    But I'm sure about the second issue, why named captures behave the same as numbered captures when it comes to readonlyness. Named captures are stored the same way as numbered captures. The difference is that when the regexp is compiled perl makes a mapping from name to number. And then when you do $+{first} perl looks up first and sees it's mapped to 1, meaning that $+{first} is just another name for $1. This also means that if $1 is read only, so is $+{first}.

    Now, I easily understand why numbered captures should be read only, since they may be clobbered by another regexp-like operator. But from the UI POV, IMHO (sic!) if you use named captures, then you have control enough to know better: if you do want to clobber them, then it's your business
    That reasoning I don't understand. How does using named captures give you more control? If numbered captures can be clobbered, how come named captures don't?
      First on why numbered captures are read-only. I'm not quite sure about the reason, but I think it's a performance issue. Captures aren't actually stored as copies, but as indexes into the string. This saves a lot of copying, and gives better performance.

      I personally believe this pretty much explains it all, and in a reasonably and well acceptable manner. I still think that perhaps one could still retain performance while allowing modification of either numbered or named captures by "copying-on-modification" the actual meaning of which is obvious. (But then I admit I don't have the slightest idea of the difficulties that may arise in the actual implementation, so apologies in advance to those who hack down there, should they find offensive the fact that I put it down in such simple terms...)

      That reasoning I don't understand. How does using named captures give you more control? If numbered captures can be clobbered, how come named captures don't?

      Well, I must confess that it's part of a thinko. I was reasoning much in the context of regexp-like operations in the substitution part of an s/// operator, which admittedly is not something you do everyday. I posted an example which in fact would be more of a counter-example yesterday. If I had a modifiable %+, then the code there would become:

      s/ ^ \b (?<head> [ \w \s \[ \] ]+ \s+ \( ) $ (?<body> .*? ^\)$ ) / $+{body} =~ s|QUALIFIED|| unless $+{head} ~~ m|^\w+?clk\[\d\]|; $+{head} . $+{body} /gemsx;

      It's clear that I also naively expect %+ not to be clobbered by the match, which would hardly be the case, since as you say named captures are nothing but other names for numbered ones! (I still think it would be a nice thing if the above could work as expected.) Indeed, from the UI POV, if %+ were retained across match operations, then you would have more control. In fact you may want to do... [/me's thinkering of some not too convoluted example...]

      doit if $x ~~ / (?<x1> \w+)\s+(?<x2> \w+) /x and $y ~~ / (?<y1> \w+)\s+(?<y2> \w+) /x and $+{x1} . $+{y2} eq $+{y1} . $+{x2};

      (Please, do not point out OWTDI!) The point here is that if I didn't have named captures, necessarily the second match's $1 and $2 would clobber the first ones'. Now, this was a thinko because both numbered captures and %+ (its implementation's details apart) must be reset or else the latter would grow indefinitely across the program...


      Re the last point of the previous paragraph, one crazy idea I'm having now is that occasionally it would be nice to have that behaviour, as in the previous code example, and that it may be triggered by a lexical %+, by analogy with the new lexical $_. Thus

      { my %+; doit if $x ~~ / (?<x1> \w+)\s+(?<x2> \w+) /x and $y ~~ / (?<y1> \w+)\s+(?<y2> \w+) /x and $+{x1} . $+{y2} eq $+{y1} . $+{x2}; }

      would do what I mean, and restore the "normal" behaviour upon exiting the lexical scope. How 'bout this idea?

      --
      If you can't understand the incipit, then please check the IPB Campaign.
Re: Why are 5.10's named captures read only?
by ambrus (Abbot) on Oct 20, 2008 at 11:39 UTC

    What I'd like to see is a way to get the positions of named captures in the string, analog to the @- and @+ variables.

      Yes, indeed.

      However, just like exposing only the string value of the named captures (as Perl 5.010 does) is convenient (for the user) but doesn't allow for the full abstraction of the feature (leaving no way to reliably find the offsets), so too would exposing the offsets not be the full abstaction and would still leave off a useful feature.

      I'd like a way to (reliably) get at the number of the (numbered) capture that matches the named capture. That would allow one to then get at the offsets for any named capture which would then allow one to get at the substring matched.

      I may end up parsing the regex myself, since I also would like to know when a capture is part of a look-ahead or look-behind. But the regex syntax has had so many enhancements added recently that parsing regexes currently looks like something that will require timely maintenance.

      - tye        

        I don't really understand how exposing the offsets and length would not be the full abstraction. Exposing the number of the numbered capture might still be a better interface even that way.

      ISTR (and would expect anyway) that in Perl 6 there's provision for all these kinda things, everything being an object, by means of suitable methods. I suppose that under Perl 5 you would expect yet another pair of special variables instead. But which ones? All of the good ones seem to be gone, and also many of the bad ones!!

      Perhaps, since AFAICT %_ is always free, it may have been chosen to hold the named captures instead of %+, and %+ and %- to hold the info you need, for analogy with @+ and @-... Before posting I was also thinking that perhaps %- were free, but it's not the case: it is... "%+ on steroids..."

      --
      If you can't understand the incipit, then please check the IPB Campaign.
        There are a gazillion variables "free" to choice from, unless you insist on one-character punctuation variables. Frankly, I don't think you need a one-character punctuation variable for this, and
        %{^MATCH_OFFSETS}
        will do fine. Personally, I'd like the values being arrays of arrays, the inner arrays 2 elements, the index of the start of the match, and the index just after the end of the match. (that is, similar to @- and @+). The outer array will hold as many captures with that name there are, so if you have:
        "abc" =~ /(?<l>[a-z])(?<l>[a-z])(?<l>[a-z])/
        the result is:
        %{^MATCH_OFFSETS} = ('l' => [[0, 1], [1, 2], [2. 3]]);

        I'm also pretty sure that if someone write a patch, it will be added to Perl.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://718043]
Approved by SankoR
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2024-04-24 23:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found