Plug for an alternate regex engine

Putter wondered on #perl6 if there was a way to write something that can be matched just like a regex and will set all the $1, $2, $&, @-, $+, $^N variables correctly. I was told the obvious things didn't work so I didn't try them :) and managed to get something close to a solution ($^N is wrong). Just to be clear the problem is to supply a $qr which sets all perl's regex vars yet could be using a different regex engine.

 $str =~ $qr;
 print "$1, $2\n"; # or whatever
[download]

This code is proof of concept and has only been tested on the single instance shown.

#!/usr/bin/perl -w

use strict;
use re 'eval';
sub showvars ;

my $s = "abcdefghi";
my (@a, $pos);

print qq{"$s" =~ /(b)(.(.))/;\n};
#match($s, qr{(b)(.(.))}x ) and exit;

# This doesn't set $^N correctly and we need to know the nesting
# The rest of the regex vars should be ok if you know the number of pa
+rens
# If not then add braces up to $99, but $+, $#+ and the extra $n's wil
+l be wrong
match($s, qr/
    (?{ $pos = 1;   # pos if $& start

        #  @a = your_fn($_)
        @a = ([0,3],[0,1],[1,2],[2,1]); 
                    # offset & length of captures
                    # $a[0] is $&, $a[1] is $1, etc. 
                    # $a[1][0] is $-[0] and $a[1][1] is $+[0] - $-[0]
      })

    # Capture $1:
    # Wrap this in (?= ) to not bump pos
    (?= 
        (??{ qr!.{$a[1][0]}! })     # Advance to start of $1
        ((??{ qr!.{$a[1][1]}! }))   # Capture the right length
    )
    # $2, $3, etc.
    (?= (??{ qr!.{$a[2][0]}! }) ((??{ qr!.{$a[2][1]}! })) )
    (?= (??{ qr!.{$a[3][0]}! }) ((??{ qr!.{$a[3][1]}! })) )
    # I think the parens are counted at regex compile time
    # so they need to be known in advance (or $+, $#+, $4 will be wron
+g)

    # bump pos until where at the right spot
    (??{ (pos == $pos) ? qr{} : qr{(?!)}; })

    # capture $&
    (??{ qr!.{$a[0][1]}! })
/xs );

sub match {
    my ($s, $qr) = @_;
    $s =~ $qr or die "No match $s =~ $qr";

    showvars qw($` $& $');
    showvars qw($+ $^N);
    showvars qw($1 $2 $3 $4 $5 $6);
    showvars qw(@-);
    showvars qw(@+);
}

sub showvars {
    no warnings 'uninitialized';
    print "$_ = (",join(",",eval $_),") " for @_;
    print "\n";
}

# $+ text of last sucessful match
# $^N similar, but of last rightmost closing paren
# @+ array or end pos, $+[0] is whole, $#+ is last good
[download]

The output is:

"abcdefghi" =~ /(b)(.(.))/;
$` = (a) $& = (bcd) $' = (efghi) 
$+ = (d) $^N = (d) 
$1 = (b) $2 = (cd) $3 = (d) $4 = () $5 = () $6 = () 
@- = (1,1,2,3) 
@+ = (4,2,4,4)
[download]

Brad

Comment on Plug for an alternate regex engine Select or Download Code

Replies are listed 'Best First'.
Re: Plug for an alternate regex engine by hv (Prior) on Feb 21, 2006 at 10:46 UTC
It is certainly possible to write a function that accepts information about the values you want to set for the variables, that returns a regexp to set them up in that way. As you supposed, the parens are counted at regexp compile time, so it is not possible to embed all the logic in a regexp without fixing the paren count in advance. `$^N` will be the first ~~match~~ capture that reaches the maximum value of `@+[1..]`, which can be emulated by constructing the regexp to nest the captures that end at this point. You may need to allow for some captures being unset, as in `"ac" =~ /(a)?(b)?(c)?/`. For the simple case where the nesting is natural, most efficient would be to forget the lookaheads and just construct a nesting of dots and parens, along with a simple negative lookahead for unset parens `((?!))?`. I think something like the below would do it, but I have not tested it exhaustively: my $s = "abcdefghi"; my @a = ([ 0, 3 ], [ 0, 1 ], [ 1, 2 ], [ 2, 1 ]); my $qr = matcher(\@a); match($s, $qr); sub matcher { my $array = shift; my(%pos, @undef); my($start, $length) = @{ $array->[0] }; for (1 .. $#$array) { my($pos, $width) = ($array->[$_][0], $array->[$_][1]); if (defined $pos) { $pos{$pos} .= '('; $pos{$pos + $width} .= ')'; } else { push @undef, $_; } } my $qr = '.' x ($start + $length); for (sort { $b <=> $a } keys %pos) { substr $qr, $_, 0, $pos{$_}; } for (reverse @undef) { $qr =~ s/((.*?\(){$_})/$1((?!))?/; } $qr =~ s{(.{$start})}{(?<=^$1)} if $start; qr/$qr/; } [download] Extending this to add lookaheads for captures that are not naturally nested is left as an exercise for the reader. :) With respect to your title, note that it is possible to plug in an alternate regexp engine - this is how `use re 'debug'` is implemented - but I'm not aware that anyone has ever taken advantage of this, nor do I imagine there is any documentation of how you might do so. Hugo	[reply] [d/l] [select]
Re: Plug for an alternate regex engine by bsb (Priest) on Feb 21, 2006 at 18:51 UTC
As suggested by hv, I got the $^N working. Again you need to know the nesting in advance. This is the $2 & $3 bit: `(?= (??{ qr!.{$a[2][0]}! }) ( # start $2 (?= (??{ qr!.{${\($a[3][0]-$a[2][0])}}! }) + ((??{ qr!.{$a[3][1]}! })) ) # $3 nested (??{ qr!.{$a[2][1]}! }) ) # $2 ended )` [download]	[reply] [d/l]


Your skill will accomplish what the force of many cannot
	PerlMonks