http://qs321.pair.com?node_id=1171492


in reply to Re: Get a known substring from a string
in thread Get a known substring from a string

That is equivalent to $found = 'thispart' if 1+index( $str, 'thispart' );, but index is 3 times faster:

s='the quick brown fox jumps over the lazy dog'; cmpthese -1,{ a => q[ if( $s =~ m[(lazy)] ){ $found=$1 } ], b => q[ $found = 'lazy' if 1+index( $s, 'lazy' ); ], };; Rate a b a 585631/s -- -77% b 2535746/s 333% --

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: Get a known substring from a string
by Corion (Patriarch) on Sep 10, 2016 at 09:42 UTC

    I know it looks really trivial once you see it, but I'm really astonished by your approach of using 1+index(...) - it had not occurred to me to use index that way in an expression to check for presence. I'll add that to my set of idiosyncratic phrases, just like if( system(...) == 0 ) { for successful execution of subprocesses.

    Update: I wondered about how much the capturing parentheses cost, and it seems they account for roughly a third half of the performance attainable when using the regex engine. Maybe the two additional steps executed in the regex engine (OPEN1 and CLOSE1) are to blame for that, as they effectively double the number of steps the regex engine has to execute for a successful match.

    Not invoking the regex engine still is much faster, even though I had thought there once was an optimization that turned constant regular expressions without anchors or quantifiers into an index lookup...

    # a: if( $s =~ m[(lazy)] ){ $found=$1 } Compiling REx "(lazy)" Final program: 1: OPEN1 (3) 3: EXACT <lazy> (5) 5: CLOSE1 (7) 7: END (0) anchored "lazy" at 0 (checking anchored) minlen 4 Matching REx "(lazy)" against "the quick brown fox jumps over the lazy + dog" Intuit: trying to determine minimum start position... Found anchored substr "lazy" at offset 35... (multiline anchor test skipped) try at offset... Intuit: Successfully guessed: match at offset 35 35 < the > <lazy dog> | 1:OPEN1(3) 35 < the > <lazy dog> | 3:EXACT <lazy>(5) 39 <the lazy> < dog> | 5:CLOSE1(7) 39 <the lazy> < dog> | 7:END(0) Match successful! Freeing REx: "(lazy)" # b: $found = 'lazy' if 1+index( $s, 'lazy' ); # c: if( $s =~ m[lazy] ){ $found=$& } Compiling REx "lazy" Final program: 1: EXACT <lazy> (3) 3: END (0) anchored "lazy" at 0 (checking anchored isall) minlen 4 Matching REx "lazy" against "the quick brown fox jumps over the lazy d +og" Intuit: trying to determine minimum start position... Found anchored substr "lazy" at offset 35... (multiline anchor test skipped) try at offset... Intuit: Successfully guessed: match at offset 35 Freeing REx: "lazy" Rate a c b a 2038631/s -- -50% -75% c 4089154/s 101% -- -49% b 8013601/s 293% 96% --

    The program I used:

    use strict; use Benchmark 'cmpthese'; use vars '$s'; $s='the quick brown fox jumps over the lazy dog'; my $found; my %benchmarks = ( a => q[ if( $s =~ m[(lazy)] ){ $found=$1 } ], b => q[ $found = 'lazy' if 1+index( $s, 'lazy' ); ], c => q[ if( $s =~ m[lazy] ){ $found=$& } ], ); { use re 'debug'; for (sort keys %benchmarks) { print "# $_: $benchmarks{$_}\n"; undef $found; my $code = eval qq{sub { $benchmarks{$_} } } or die "Couldn't compile benchmark $_: $@"; $code->(); $found eq 'lazy' or die "Unexpected results: [$found] vs. 'lazy'"; }; }; cmpthese( -1, \%benchmarks);
      I'm really astonished by your approach of using 1+index(...) - it had not occurred to me to use index that way in an expression to check for presence.

      I've been using it that way for as long as I can remember. This turns up 50+ of my uses here with the earliest being in September 2002 which is only a few months after I started coming here.

      Adding the one has another benefit when doing searches in a loop:

      ##do something with $p - 1 while $p = 1 + index( $haystack, $needle, $ +p );

      That of automatically moving the start point along after each match.

      You have to remember to subtract 1 when using the match point; but that's no more onerous than remembering to increment it.

      I had thought there once was an optimization that turned constant regular expressions without anchors or quantifiers into an index lookup...

      That optimisation is there, and can be even quicker than index but you have to code it exactly correctly for it to kick in::

      $s = 'the quick brown fox jumps over the lazy dog'; cmpthese -1,{ a => q[ if( $s =~ m[(lazy)]){ $found=$1 } ], b => q[ $found = 'lazy' if 1+index( $s, 'lazy' ); ], c => q[ $found = 'lazy' if $s =~ 'lazy'; ], };; Rate a b c a 577066/s -- -77% -79% b 2492720/s 332% -- -11% c 2791311/s 384% 12% -- [0]{} Perl>

      Unfortunately, it doesn't generalise. Even using a variable instead of literal means some of that performance gain is lost:

      $s = 'the quick brown fox jumps over the lazy dog'; $x = 'lazy'; cmpthese -1,{ a => q[ if( $s =~ m[($x)]){ $found = $1 } ], b => q[ $found = $x if 1 + index( $s, $x ); ], c => q[ $found = $x if $s =~ $x ], };; Rate a c b a 449697/s -- -79% -82% c 2167332/s 382% -- -12% b 2462877/s 448% 14% -- $s = 'the quick brown fox jumps over the lazy dog'; $x = 'lazy'; cmpthese -1,{ a => q[ if( $s =~ m[($x)]){ $found = $1 } ], b => q[ $found = $x if 1 + index( $s, $x ); ], c => q[ $found = $x if $s =~ $x ], };; Rate a c b a 459542/s -- -79% -80% c 2184810/s 375% -- -6% b 2318112/s 404% 6% --

      The nature of the type of work I do means that I learnt early on to only start the regex engine if I needed regex.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Don't get me wrong - TIMTOWTDI. But how is that expression nearly as clear as index($str,  $whatever) > -1? For the system, it is probably equivalent regarding the overall number of operations. For the human brain (mine at least), it is one less.

        Select wisely the idioms you use. Others reading your code will thank you. I, for one, would not use that one because I find it is just for the sake of brevity without its being worth one more thought every time it is read.

        while ( ($p = index $haystack, $needle, $p+1) > -1 ) { # e.g. (not tested) pos($haystack) = $p; $p = pos($haystack) if $str =~ s{ \G ... }{...}cxms # regex doing the tough dishes ; # /c flag: preserve pos() on non-match }