I've got a new regex trick up my sleeve, and although I'm proposing a patch for it, I'm not sure it'll be accepted. It's a new pseudo-anchor, \K, which tells the engine to pretend it just started matching. That's not a very good explanation, so let me use an example:
# code to remove '.z' from $str
# old way (slow)
$str = join '.', 'a' .. 'z';
$str =~ s/(.*)\..*/$1/;
# new way (fast)
$str = join '.', 'a' .. 'z';
$str =~ s/.*\K\..*//;
I'll post more soon, when I have a patch and what-not, but I hope this gives you a good idea of what this anchor does.
Another way of looking at it is it lets you have a variable-width look-behind at the beginning of your regex... it allows your regex to have a prelude, as it were.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re: New regex trick...
by erikharrison (Deacon) on Jul 22, 2002 at 19:33 UTC
|
I understand (I think) why this is fast. What I don't understand is how it works. I'll admit that I am not much of regex hacker, but how does the variable length look behind now how far to look, in the case of quantifiers, especially greedy ones? It seems that the goal of the anchor is to prevent backtracking (hence the speed gains) but how does it now when to stop? If the \K anchor tells the regex engine to watch for the oncoming regex and stop matching there, whats the difference between \K and a minimal match, or negated character class? If the engine doesn't stop at the first instance of, say '.' (as in your example) how do you keep it from backtracking (which is where I'm presuming the speed gains come from)? Or is \K an optimization using sexegers?
Now, japhy I have absolute faith in your pattern - foo, so I take it on faith that this works. What I'm curious about is how. Am I missing something in thinking that the performance gains here are related to backtracking?
Cheers,
Erik
Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet
| [reply] |
|
You're thinking too hard. I cheated. My patch just fools the regex engine into thinking it hasn't actually started matching yet. Here's a drawn-out example.
$str = "abc.def.ghi.jkl";
$str =~ s{
.* # match as much as you can
\K # and then pretend HERE is where we start
\. .* # then a . and anything else
}{}x; # replace with nothing
__END__
abc.def.ghi.jkl $&
AAAAAAAAAAA .* "abc.def.ghi"
\K ""
B \. "."
BCCC .* ".jkl"
Does that help you see what I do? My patch consists of a couple lines of support, but this is the beef:
case KEEP:
PL_regstartp[0] = locinput - PL_bostr;
break;
That's what happens when the regex engine encounters the \K. The rest of the patch is just creating the "KEEP" node, and telling toke.c that "\K" is a valid escape sequence.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??; | [reply] [d/l] [select] |
|
Okay, I know it's because I'm dumb, but I still don't get it. Please don't yell at me, but why does the \K anchor keep .* from matching .jkl? And if it backtracks like normal, then where does the speed come from? I think that may be the essence of my confusion - why is this faster?
If I don't get it this time, I'll give up and just trust it ;-).
Cheers,
Erik
Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet
| [reply] |
|
|
|
Re: New regex trick...
by broquaint (Abbot) on Jul 22, 2002 at 19:34 UTC
|
So \K basically means don't worry, I only started matching *here*, where here is the position where the previous atom of the regex stopped matching e.g
my $str = "foo bar baz";
# will 'start' matching here ^
$str =~ s< (?: \w+ [ ] )+ \K \w+ >()x;
print '[' . $str . ']';
__output__
[foo bar ]
Or should I just go back to implementing pointers in perl?
HTH
_________ broquaint
update: added missing whitespace in output (odd how an additional character requires so many to explain ;) | [reply] [d/l] |
|
You've got the right idea (except for a missing whitespace in your output).
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
| [reply] |
Re: New regex trick...
by japhy (Canon) on Jul 22, 2002 at 20:18 UTC
|
If this patch doesn't go through (and I can see reasons why it wouldn't, but I'd like it to go through), the same functionality can be achieved by my new module, Regexp::Keep. It's not on CPAN yet, but it's an XS module that does two things:
- filters regexes for the \K escape sequence, and turns them into (?{Regexp::Keep::KEEP})...
- and defines an XS function, KEEP(), that does what \K is supposed to do
Of course, constant-overloading is nasty, so I hope my patch is received well.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??; | [reply] |
|
Can you post some benchmarks for the module? Or are the ones you posted above from the module (I'm assuming not).
| [reply] |
|
use Benchmark 'cmpthese';
use Regexp::Keep;
my $s = "abc.def.ghi.jkl";
cmpthese(-5, {
japhy => sub { (my $x = $s) =~ s/.*\K\..*// },
old => sub { (my $x = $s) =~ s/(.*)\..*/$1/ },
});
In 5.6.1:
old = 40280.54/s
japhy = 74338.79/s
And with 5.8.0 (using the module, NOT my patch):
old = 58188.18/s
japhy = 102409.25/s
So there's an appreciable speed-up.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??; | [reply] [d/l] [select] |
Re: New regex trick...
by FoxtrotUniform (Prior) on Jul 22, 2002 at 19:18 UTC
|
What's the mnemonic for \K?
Update: Ah, got it.
This is sort of like Prolog's "cut", isn't it? Match
a prefix, then never backtrack back past the \K anchor to
satisfy the regex. Can you have multiple \Ks in a single
regex? I can see Abigail-II having lots of fun with
this.
--
The hell with paco, vote for Erudil!
:wq
| [reply] [d/l] |
|
Yes, you can have multiple \K's in a regex, although only the last one really means anything, should the regex succeed. I'm actually curious why you would need more than one in the same branch, but it's perfectly legal.
But no, it's not cut. Cut is (?>pat). This is just telling the regex engine to pretend it just started matching, that $& is defined starting HERE.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
| [reply] [d/l] |
|
There are two. K = Keep, as in, "keep everything we've just matched", because I envision it being used in substitutions. The other mnemonic is "\K isn't used by anything yet". ;)
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
| [reply] |
Re: New regex trick...
by vladb (Vicar) on Jul 22, 2002 at 19:23 UTC
|
So, to simply remove '.z' you go in all the trouble of doing the '.*' (glob all) match etc etc. Whereas a shorter version must be faster:
$str =~ s/\..$//;
Unless I'm missing some other deeply nested point {griN}
_____________________
# Under Construction
| [reply] [d/l] [select] |
|
Yours is a very special case, because we know there's only one character after the last ".". Here is a benchmark I ran with my new patch:
use Benchmark 'cmpthese';
my $str = join '.', 'a' .. 'z';
cmpthese(-5, {
old => sub { (my $x = $str) =~ s/(.*)\..*/$1/ },
new => sub { (my $x = $str) =~ s/.*\K\..*// },
});
__END__
new: 109650.19/s (n=578953)
old: 45258.35/s (n=241227)
... which is an incredible speed increase!
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??; | [reply] [d/l] |
Re: New regex trick...
by dws (Chancellor) on Jul 22, 2002 at 20:06 UTC
|
It's a new pseudo-anchor, \K, which tells the engine to pretend it just started matching.
What's the impact on pos() if you fake out the starting point? Does pos() still reflect the start of matching, or does it reflect the start of the last \K?
| [reply] [d/l] [select] |
|
Present an example that explains your question. pos() refers to the end of the last global match. Are you asking about any interference between \K and \G? I do not foresee any.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
| [reply] |
|
| [reply] [d/l] |
Re: New regex trick...
by dimmesdale (Friar) on Jul 24, 2002 at 21:45 UTC
|
If you're looking for input, then I'd have to say that that's a wonderful idea. It's odd, because I was just wishing there was something like this over the past week -- for one, I think it makes the substitutions look a little bit cleaner as to their true intent (but maybe that's just me).
Look forward to it. | [reply] |
|
|