Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Plug for an alternate regex engine

by bsb (Priest)
on Feb 20, 2006 at 17:03 UTC ( [id://531465]=perlmeditation: print w/replies, xml ) Need Help??

Putter wondered on #perl6 if there was a way to write something that can be matched just like a regex and will set all the $1, $2, $&, @-, $+, $^N variables correctly. I was told the obvious things didn't work so I didn't try them :) and managed to get something close to a solution ($^N is wrong). Just to be clear the problem is to supply a $qr which sets all perl's regex vars yet could be using a different regex engine.
$str =~ $qr; print "$1, $2\n"; # or whatever
This code is proof of concept and has only been tested on the single instance shown.
#!/usr/bin/perl -w use strict; use re 'eval'; sub showvars ; my $s = "abcdefghi"; my (@a, $pos); print qq{"$s" =~ /(b)(.(.))/;\n}; #match($s, qr{(b)(.(.))}x ) and exit; # This doesn't set $^N correctly and we need to know the nesting # The rest of the regex vars should be ok if you know the number of pa +rens # If not then add braces up to $99, but $+, $#+ and the extra $n's wil +l be wrong match($s, qr/ (?{ $pos = 1; # pos if $& start # @a = your_fn($_) @a = ([0,3],[0,1],[1,2],[2,1]); # offset & length of captures # $a[0] is $&, $a[1] is $1, etc. # $a[1][0] is $-[0] and $a[1][1] is $+[0] - $-[0] }) # Capture $1: # Wrap this in (?= ) to not bump pos (?= (??{ qr!.{$a[1][0]}! }) # Advance to start of $1 ((??{ qr!.{$a[1][1]}! })) # Capture the right length ) # $2, $3, etc. (?= (??{ qr!.{$a[2][0]}! }) ((??{ qr!.{$a[2][1]}! })) ) (?= (??{ qr!.{$a[3][0]}! }) ((??{ qr!.{$a[3][1]}! })) ) # I think the parens are counted at regex compile time # so they need to be known in advance (or $+, $#+, $4 will be wron +g) # bump pos until where at the right spot (??{ (pos == $pos) ? qr{} : qr{(?!)}; }) # capture $& (??{ qr!.{$a[0][1]}! }) /xs ); sub match { my ($s, $qr) = @_; $s =~ $qr or die "No match $s =~ $qr"; showvars qw($` $& $'); showvars qw($+ $^N); showvars qw($1 $2 $3 $4 $5 $6); showvars qw(@-); showvars qw(@+); } sub showvars { no warnings 'uninitialized'; print "$_ = (",join(",",eval $_),") " for @_; print "\n"; } # $+ text of last sucessful match # $^N similar, but of last rightmost closing paren # @+ array or end pos, $+[0] is whole, $#+ is last good
The output is:
"abcdefghi" =~ /(b)(.(.))/; $` = (a) $& = (bcd) $' = (efghi) $+ = (d) $^N = (d) $1 = (b) $2 = (cd) $3 = (d) $4 = () $5 = () $6 = () @- = (1,1,2,3) @+ = (4,2,4,4)
Brad

Replies are listed 'Best First'.
Re: Plug for an alternate regex engine
by hv (Prior) on Feb 21, 2006 at 10:46 UTC

    It is certainly possible to write a function that accepts information about the values you want to set for the variables, that returns a regexp to set them up in that way.

    As you supposed, the parens are counted at regexp compile time, so it is not possible to embed all the logic in a regexp without fixing the paren count in advance.

    $^N will be the first match capture that reaches the maximum value of @+[1..], which can be emulated by constructing the regexp to nest the captures that end at this point.

    You may need to allow for some captures being unset, as in "ac" =~ /(a)?(b)?(c)?/.

    For the simple case where the nesting is natural, most efficient would be to forget the lookaheads and just construct a nesting of dots and parens, along with a simple negative lookahead for unset parens ((?!))?. I think something like the below would do it, but I have not tested it exhaustively:

    my $s = "abcdefghi"; my @a = ([ 0, 3 ], [ 0, 1 ], [ 1, 2 ], [ 2, 1 ]); my $qr = matcher(\@a); match($s, $qr); sub matcher { my $array = shift; my(%pos, @undef); my($start, $length) = @{ $array->[0] }; for (1 .. $#$array) { my($pos, $width) = ($array->[$_][0], $array->[$_][1]); if (defined $pos) { $pos{$pos} .= '('; $pos{$pos + $width} .= ')'; } else { push @undef, $_; } } my $qr = '.' x ($start + $length); for (sort { $b <=> $a } keys %pos) { substr $qr, $_, 0, $pos{$_}; } for (reverse @undef) { $qr =~ s/((.*?\(){$_})/$1((?!))?/; } $qr =~ s{(.{$start})}{(?<=^$1)} if $start; qr/$qr/; }

    Extending this to add lookaheads for captures that are not naturally nested is left as an exercise for the reader. :)

    With respect to your title, note that it is possible to plug in an alternate regexp engine - this is how use re 'debug' is implemented - but I'm not aware that anyone has ever taken advantage of this, nor do I imagine there is any documentation of how you might do so.

    Hugo

Re: Plug for an alternate regex engine
by bsb (Priest) on Feb 21, 2006 at 18:51 UTC
    As suggested by hv, I got the $^N working. Again you need to know the nesting in advance. This is the $2 & $3 bit:
    (?= (??{ qr!.{$a[2][0]}! }) ( # start $2 (?= (??{ qr!.{${\($a[3][0]-$a[2][0])}}! }) + ((??{ qr!.{$a[3][1]}! })) ) # $3 nested (??{ qr!.{$a[2][1]}! }) ) # $2 ended )

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://531465]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-25 16:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found