Optimizing a regex that replaces near the end of a string

perl5ever has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Optimizing a regex that replaces near the end of a string by shmem (Chancellor) on Mar 18, 2009 at 20:57 UTC
Perl's regex engine doesn't search from the end, but you can anchor your match at the end to make it backtrack. See perlre. perl -Dr -e ' @l = ("A".."Z","a".."z",qw(< / >),0..9); $_ = join( "", map { $l[ rand @l ] } 1 .. 100_000 ) . "</body></html>"; s\|(</body></html>)$\|zzz$1\| ' Compiling REx "(</body></html>)$" rarest char < at 0 Final program: 1: OPEN1 (3) 3: EXACT <</body></html>> (8) 8: CLOSE1 (10) 10: EOL (11) 11: END (0) anchored "</body></html>"$ at 0 (checking anchored) minlen 14 Omitting $` $& $' support. EXECUTING... Guessing start of match in sv for REx "(</body></html>)$" against "Fw3 +EK>>Y6x<q>x4s7sACk6xtpG9etod1O4uibfbVBwJNJLRurKYn>SdVRt2o"... Found anchored substr "</body></html>"$ at offset 100000... Starting position does not contradict /^/m... Guessed: match at offset 100000 Matching REx "(</body></html>)$" against "</body></html>" 100000 <yUnOt> <</body></h> \| 1:OPEN1(3) 100000 <yUnOt> <</body></h> \| 3:EXACT <</body></html>>(8) 100014 <body></html>> <> \| 8:CLOSE1(10) 100014 <body></html>> <> \| 10:EOL(11) 100014 <body></html>> <> \| 11:END(0) Match successful! Freeing REx: "(</body></html>)$" [download] As you see, the anchor will be checked first, and then the rest is quite inexpensive. Small benchmark... `use Benchmark qw(cmpthese); @l = ("A".."Z","a".."z",qw(< / >),0..9); $x = join( '', map { $l[ rand @l ] } 1 .. 100_000 ) . "</body></html>" +; cmpthese ( -1, { noanchor => sub { local $_ = $x; s\|(</body></html>)\|zzz$1\| }, anchor => sub { local $_ = $x; s\|(</body></html>)$\|zzz$1\| }, } ); Rate noanchor anchor noanchor 5023/s -- -19% anchor 6222/s 24% --` [download]	[reply] [d/l] [select]
Re: Optimizing a regex that replaces near the end of a string by almut (Canon) on Mar 18, 2009 at 20:48 UTC
What's a good way to write the regular expression so that perl will start searching from the end of the string? That's what end anchors (`$`, `\z`, `\Z`) are for... And if you only need to do a simple substring search, you can also use `rindex()`.	[reply] [d/l] [select]
Re: Optimizing a regex that replaces near the end of a string by thunders (Priest) on Mar 18, 2009 at 21:31 UTC
If your end string is really something static like that, you may not want to use a regex at all. Here's a case insensitive replacement using rindex and substr: `use Benchmark qw(cmpthese); fbard@devo1:~/spot_camp_layout$ cat rindex use Benchmark qw(cmpthese); my $x = ("x" x 100_000)."</body></html>"; my $lc_x = lc($x); my $end_html = qr{(</body></html>)}; cmpthese(100_000,{ rindex_substr => sub{ my $str = $x; substr($str,rindex($lc_x,"</body>"),0,"yyyy"); }, regex => sub{ my $str = $x; $str =~ s\|$end_html$\|yyyy$1\|i; } }); Rate regex rindex_substr regex 18416/s -- -75% rindex_substr 73529/s 299% --` [download]	[reply] [d/l]
Re: Optimizing a regex that replaces near the end of a string by thunders (Priest) on Mar 18, 2009 at 22:55 UTC
Another technique I've used with some success is simply calling reverse() on the string, running the regex, and then reversing again. You'll want to benchmark against your typical input, of course.	[reply]
Re: Optimizing a regex that replaces near the end of a string by moritz (Cardinal) on Mar 18, 2009 at 23:28 UTC
This might not apply in your specific case, but perhaps a nice idea anyway: Suppose you know an exact substring that occurs before the match. Then you can find it's position with rindex, set pos to that value, and add `\G.*` in front of your regex.	[reply] [d/l]
Re: Optimizing a regex that replaces near the end of a string by JavaFan (Canon) on Mar 19, 2009 at 10:25 UTC
You should realize that 1) Perl regexes aren't very fast, and 2) the optimizer is really awesome. Perl regexes look fast because of the optimizer. If you have a string like that, and you need to replace something near `</body>`, I presume your pattern has `</body` in it. It might very well be that the optimizer spots this and does all the optimization you want already. Try running it with regex debugging enabled (if you have 5.10).	[reply] [d/l] [select]


Don't ask to ask, just ask
	PerlMonks