Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Optimizing a regex that replaces near the end of a string

by perl5ever (Pilgrim)
on Mar 18, 2009 at 20:14 UTC ( [id://751567]=perlquestion: print w/replies, xml ) Need Help??

perl5ever has asked for the wisdom of the Perl Monks concerning the following question:

Suppose you have:
my $x = ("x" x 100_000)."</body></html>";
and you want to put something just before the </body> tag. What's a good way to write the regular expression so that perl will start searching from the end of the string?

Replies are listed 'Best First'.
Re: Optimizing a regex that replaces near the end of a string
by shmem (Chancellor) on Mar 18, 2009 at 20:57 UTC

    Perl's regex engine doesn't search from the end, but you can anchor your match at the end to make it backtrack. See perlre.

    perl -Dr -e ' @l = ("A".."Z","a".."z",qw(< / >),0..9); $_ = join( "", map { $l[ rand @l ] } 1 .. 100_000 ) . "</body></html>"; s|(</body></html>)$|zzz$1| ' Compiling REx "(</body></html>)$" rarest char < at 0 Final program: 1: OPEN1 (3) 3: EXACT <</body></html>> (8) 8: CLOSE1 (10) 10: EOL (11) 11: END (0) anchored "</body></html>"$ at 0 (checking anchored) minlen 14 Omitting $` $& $' support. EXECUTING... Guessing start of match in sv for REx "(</body></html>)$" against "Fw3 +EK>>Y6x<q>x4s7sACk6xtpG9etod1O4uibfbVBwJNJLRurKYn>SdVRt2o"... Found anchored substr "</body></html>"$ at offset 100000... Starting position does not contradict /^/m... Guessed: match at offset 100000 Matching REx "(</body></html>)$" against "</body></html>" 100000 <yUnOt> <</body></h> | 1:OPEN1(3) 100000 <yUnOt> <</body></h> | 3:EXACT <</body></html>>(8) 100014 <body></html>> <> | 8:CLOSE1(10) 100014 <body></html>> <> | 10:EOL(11) 100014 <body></html>> <> | 11:END(0) Match successful! Freeing REx: "(</body></html>)$"

    As you see, the anchor will be checked first, and then the rest is quite inexpensive. Small benchmark...

    use Benchmark qw(cmpthese); @l = ("A".."Z","a".."z",qw(< / >),0..9); $x = join( '', map { $l[ rand @l ] } 1 .. 100_000 ) . "</body></html>" +; cmpthese ( -1, { noanchor => sub { local $_ = $x; s|(</body></html>)|zzz$1| }, anchor => sub { local $_ = $x; s|(</body></html>)$|zzz$1| }, } ); Rate noanchor anchor noanchor 5023/s -- -19% anchor 6222/s 24% --
Re: Optimizing a regex that replaces near the end of a string
by almut (Canon) on Mar 18, 2009 at 20:48 UTC
    What's a good way to write the regular expression so that perl will start searching from the end of the string?

    That's what end anchors ($, \z, \Z) are for... And if you only need to do a simple substring search, you can also use rindex().

Re: Optimizing a regex that replaces near the end of a string
by thunders (Priest) on Mar 18, 2009 at 21:31 UTC
    If your end string is really something static like that, you may not want to use a regex at all. Here's a case insensitive replacement using rindex and substr:
    use Benchmark qw(cmpthese); fbard@devo1:~/spot_camp_layout$ cat rindex use Benchmark qw(cmpthese); my $x = ("x" x 100_000)."</body></html>"; my $lc_x = lc($x); my $end_html = qr{(</body></html>)}; cmpthese(100_000,{ rindex_substr => sub{ my $str = $x; substr($str,rindex($lc_x,"</body>"),0,"yyyy"); }, regex => sub{ my $str = $x; $str =~ s|$end_html$|yyyy$1|i; } }); Rate regex rindex_substr regex 18416/s -- -75% rindex_substr 73529/s 299% --
Re: Optimizing a regex that replaces near the end of a string
by thunders (Priest) on Mar 18, 2009 at 22:55 UTC
    Another technique I've used with some success is simply calling reverse() on the string, running the regex, and then reversing again. You'll want to benchmark against your typical input, of course.
Re: Optimizing a regex that replaces near the end of a string
by moritz (Cardinal) on Mar 18, 2009 at 23:28 UTC
    This might not apply in your specific case, but perhaps a nice idea anyway:

    Suppose you know an exact substring that occurs before the match. Then you can find it's position with rindex, set pos to that value, and add \G.* in front of your regex.

Re: Optimizing a regex that replaces near the end of a string
by JavaFan (Canon) on Mar 19, 2009 at 10:25 UTC
    You should realize that 1) Perl regexes aren't very fast, and 2) the optimizer is really awesome. Perl regexes look fast because of the optimizer.

    If you have a string like that, and you need to replace something near </body>, I presume your pattern has </body in it. It might very well be that the optimizer spots this and does all the optimization you want already.

    Try running it with regex debugging enabled (if you have 5.10).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://751567]
Approved by eff_i_g
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (2)
As of 2024-04-26 00:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found