New regex trick...

Replies are listed 'Best First'.
Re: New regex trick... by erikharrison (Deacon) on Jul 22, 2002 at 19:33 UTC
I understand (I think) why this is fast. What I don't understand is how it works. I'll admit that I am not much of regex hacker, but how does the variable length look behind now how far to look, in the case of quantifiers, especially greedy ones? It seems that the goal of the anchor is to prevent backtracking (hence the speed gains) but how does it now when to stop? If the \K anchor tells the regex engine to watch for the oncoming regex and stop matching there, whats the difference between \K and a minimal match, or negated character class? If the engine doesn't stop at the first instance of, say '.' (as in your example) how do you keep it from backtracking (which is where I'm presuming the speed gains come from)? Or is \K an optimization using sexegers? Now, japhy I have absolute faith in your pattern - foo, so I take it on faith that this works. What I'm curious about is how. Am I missing something in thinking that the performance gains here are related to backtracking? Cheers, Erik Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet	[reply]
Re: Re: New regex trick... by japhy (Canon) on Jul 22, 2002 at 19:56 UTC
You're thinking too hard. I cheated. My patch just fools the regex engine into thinking it hasn't actually started matching yet. Here's a drawn-out example. `$str = "abc.def.ghi.jkl"; $str =~ s{ .* # match as much as you can \K # and then pretend HERE is where we start \. .* # then a . and anything else }{}x; # replace with nothing __END__ abc.def.ghi.jkl $& AAAAAAAAAAA .* "abc.def.ghi" \K "" B \. "." BCCC .* ".jkl"` [download] Does that help you see what I do? My patch consists of a couple lines of support, but this is the beef: `case KEEP: PL_regstartp[0] = locinput - PL_bostr; break;` [download] That's what happens when the regex engine encounters the `\K`. The rest of the patch is just creating the "KEEP" node, and telling toke.c that "\K" is a valid escape sequence. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l] [select]
Re: Re: Re: New regex trick... by erikharrison (Deacon) on Jul 22, 2002 at 20:34 UTC
Okay, I know it's because I'm dumb, but I still don't get it. Please don't yell at me, but why does the \K anchor keep .* from matching .jkl? And if it backtracks like normal, then where does the speed come from? I think that may be the essence of my confusion - why is this faster? If I don't get it this time, I'll give up and just trust it ;-). Cheers, Erik Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet	[reply]
Re: Re: Re: Re: New regex trick... by japhy (Canon) on Jul 22, 2002 at 20:49 UTC
Re: Re: Re: Re: Re: New regex trick... by redsquirrel (Hermit) on Jul 22, 2002 at 21:08 UTC
Re: Re: Re: Re: New regex trick... by shelob101 (Sexton) on Jul 23, 2002 at 20:11 UTC
Re: New regex trick... by broquaint (Abbot) on Jul 22, 2002 at 19:34 UTC
So `\K` basically means don't worry, I only started matching here, where here is the position where the previous atom of the regex stopped matching e.g `my $str = "foo bar baz"; # will 'start' matching here ^ $str =~ s< (?: \w+ [ ] )+ \K \w+ >()x; print '[' . $str . ']'; __output__ [foo bar ]` [download] Or should I just go back to implementing pointers in perl? HTH `_________ broquaint` update: added missing whitespace in output (odd how an additional character requires so many to explain ;)	[reply] [d/l]
Re: Re: New regex trick... by japhy (Canon) on Jul 22, 2002 at 19:50 UTC
You've got the right idea (except for a missing whitespace in your output). _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]
Re: New regex trick... by japhy (Canon) on Jul 22, 2002 at 20:18 UTC
If this patch doesn't go through (and I can see reasons why it wouldn't, but I'd like it to go through), the same functionality can be achieved by my new module, Regexp::Keep. It's not on CPAN yet, but it's an XS module that does two things: filters regexes for the `\K` escape sequence, and turns them into `(?{Regexp::Keep::KEEP})`... and defines an XS function, `KEEP()`, that does what `\K` is supposed to do Of course, constant-overloading is nasty, so I hope my patch is received well. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]
Re: Re: New regex trick... by mischief (Hermit) on Jul 24, 2002 at 13:00 UTC
Can you post some benchmarks for the module? Or are the ones you posted above from the module (I'm assuming not).	[reply]
Regexp::Keep benchmark by japhy (Canon) on Jul 24, 2002 at 14:35 UTC
Here's the code: `use Benchmark 'cmpthese'; use Regexp::Keep; my $s = "abc.def.ghi.jkl"; cmpthese(-5, { japhy => sub { (my $x = $s) =~ s/.\K\..// }, old => sub { (my $x = $s) =~ s/(.)\../$1/ }, });` [download] In 5.6.1: `old = 40280.54/s japhy = 74338.79/s` [download] And with 5.8.0 (using the module, NOT my patch): `old = 58188.18/s japhy = 102409.25/s` [download] So there's an appreciable speed-up. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l] [select]
Re: New regex trick... by FoxtrotUniform (Prior) on Jul 22, 2002 at 19:18 UTC
What's the mnemonic for `\K`? Update: Ah, got it. This is sort of like Prolog's "cut", isn't it? Match a prefix, then never backtrack back past the \K anchor to satisfy the regex. Can you have multiple \Ks in a single regex? I can see Abigail-II having lots of fun with this. `-- The hell with paco, vote for Erudil! :wq`	[reply] [d/l]
Re: Re: New regex trick... by japhy (Canon) on Jul 22, 2002 at 19:40 UTC
Yes, you can have multiple `\K`'s in a regex, although only the last one really means anything, should the regex succeed. I'm actually curious why you would need more than one in the same branch, but it's perfectly legal. But no, it's not cut. Cut is `(?>pat)`. This is just telling the regex engine to pretend it just started matching, that `$&` is defined starting HERE. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Re: Re: New regex trick... by japhy (Canon) on Jul 22, 2002 at 19:22 UTC
There are two. K = Keep, as in, "keep everything we've just matched", because I envision it being used in substitutions. The other mnemonic is "\K isn't used by anything yet". `;)` _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]
Re: New regex trick... by vladb (Vicar) on Jul 22, 2002 at 19:23 UTC
So, to simply remove '.z' you go in all the trouble of doing the '.*' (glob all) match etc etc. Whereas a shorter version must be faster: `$str =~ s/\..$//;` [download] Unless I'm missing some other deeply nested point {griN} _____________________ `# Under Construction` [download]	[reply] [d/l] [select]
Re: Re: New regex trick... by japhy (Canon) on Jul 22, 2002 at 19:33 UTC
Yours is a very special case, because we know there's only one character after the last ".". Here is a benchmark I ran with my new patch: `use Benchmark 'cmpthese'; my $str = join '.', 'a' .. 'z'; cmpthese(-5, { old => sub { (my $x = $str) =~ s/(.)\../$1/ }, new => sub { (my $x = $str) =~ s/.\K\..// }, }); __END__ new: 109650.19/s (n=578953) old: 45258.35/s (n=241227)` [download] ... which is an incredible speed increase! _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Re: New regex trick... by dws (Chancellor) on Jul 22, 2002 at 20:06 UTC
It's a new pseudo-anchor, \K, which tells the engine to pretend it just started matching. What's the impact on `pos()` if you fake out the starting point? Does `pos()` still reflect the start of matching, or does it reflect the start of the last \K?	[reply] [d/l] [select]
Re: Re: New regex trick... by japhy (Canon) on Jul 22, 2002 at 20:12 UTC
Present an example that explains your question. `pos()` refers to the end of the last global match. Are you asking about any interference between `\K` and `\G`? I do not foresee any. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]
Re: Re: Re: New regex trick... by dws (Chancellor) on Jul 22, 2002 at 20:24 UTC
pos() refers to the end of the last global match. Oops, you're right. The code fragement I was thinking of used `pos() - $delta` to get the starting point.	[reply] [d/l]
Re: New regex trick... by dimmesdale (Friar) on Jul 24, 2002 at 21:45 UTC
If you're looking for input, then I'd have to say that that's a wonderful idea. It's odd, because I was just wishing there was something like this over the past week -- for one, I think it makes the substitutions look a little bit cleaner as to their true intent (but maybe that's just me). Look forward to it.	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks