Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: substrings that consist of repeating characters

by Tux (Canon)
on Sep 29, 2020 at 11:40 UTC ( [id://11122323]=note: print w/replies, xml ) Need Help??


in reply to substrings that consist of repeating characters

I was surprised to see the $& being the fastest. At least on my perl-5.28:

my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT"; my %expect = qw( CCCCCC 1 GGGG 1 AAA 1 TTT 3 AA 2 GG 2 TT 5 ); use Test::More; use Benchmark qw(cmpthese); my %subs; sub v1 { %subs = (); $subs{$_}++ for grep { length >= 2 } split m/,/ => ($string =~ s/( +[ACGT])\K(?!\1)/,/gr); } # v1 sub v2 { %subs = (); $subs{$_}++ for grep m/^([ACGT])\1+$/ => split m/,/ => ($string =~ + s/(\w)\K(?!\1)/,/gr); } # v2 sub v3 { %subs = (); $subs{$_}++ for $string =~ m/(AA+|CC+|GG+|TT+)/g; } # v3 sub v4 { %subs = (); $subs{$1}++ while $string =~ m{(([ACGT])\2+)}g; } # v4 sub v5 { %subs = (); $subs{$&}++ while $string =~ m{([ACGT])\1+}g; } # v5 v1 (); is_deeply (\%subs, \%expect, "v1"); v2 (); is_deeply (\%subs, \%expect, "v2"); v3 (); is_deeply (\%subs, \%expect, "v3"); v4 (); is_deeply (\%subs, \%expect, "v4"); v5 (); is_deeply (\%subs, \%expect, "v5"); printf "%5d %3d %s\n", $subs{$_->[1]}, @$_ for sort { $b->[0] <=> $a-> +[0] || $a->[1] cmp $b->[1] } map {[ length, $_ ]} keys %subs; cmpthese (-2, { v1 => \&v1, v2 => \&v2, v3 => \&v3, v4 => \&v4, v5 => +\&v5 }); done_testing;

=>

ok 1 - v1 ok 2 - v2 ok 3 - v3 ok 4 - v4 ok 5 - v5 1 6 CCCCCC 1 4 GGGG 1 3 AAA 3 3 TTT 2 2 AA 2 2 GG 5 2 TT Rate v2 v1 v3 v4 v5 v2 41981/s -- -30% -52% -56% -57% v1 59864/s 43% -- -31% -38% -39% v3 87244/s 108% 46% -- -9% -12% v4 95919/s 128% 60% 10% -- -3% v5 98685/s 135% 65% 13% 3% -- 1..5

Enjoy, Have FUN! H.Merijn

Replies are listed 'Best First'.
Re^2: substrings that consist of repeating characters
by salva (Canon) on Sep 29, 2020 at 12:41 UTC
    IIRC, once perl sees $& anywhere in the program code, it starts to populate that variable (and $' and $`) for all the regular expression matches in the program. Using it impacts the performance of all the regular expressions in the code, not just those ones where it is actually needed!

      perlvar does mention the issue, but it also says this has been fully fixed since v5.20.

      Edit: so this would mean that you might still get the same relative positions for the different versions on older version of perls, because although $& would be significantly worse than the other solutions on their own, it would actually lower the performances of all other versions when used in the benchmark.

Re^2: substrings that consist of repeating characters
by Eily (Monsignor) on Sep 29, 2020 at 13:20 UTC

    Edit: I thought I had a better version but no. I ran the same benchmark again and the results were not the same at all (actually the three solutions had very similar performances). Something went wrong with my first attempt

    I'm actually consistantly getting result that are worse without backreferences which which I don't understand...

      my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT"; my %expect = qw( CCCCCC 1 GGGG 1 AAA 1 TTT 3 AA 2 GG 2 TT 5 ); my $n = shift // 1; if ($n > 1) { $string = $string x $n; $_ *= $n for values %expect; } use Test::More; use Benchmark qw(cmpthese); my %subs; my @v = map { "v$_" } 1 .. 8; my %f; @f{@v} = ( sub { %subs = (); $subs{$_}++ for grep { length >= 2 } split m/,/ => ($string =~ s/( +[ACGT])\K(?!\1)/,/gr); }, # v1 sub { %subs = (); $subs{$_}++ for grep m/^([ACGT])\1+$/ => split m/,/ => ($string =~ + s/(\w)\K(?!\1)/,/gr); }, # v2 sub { %subs = (); $subs{$_}++ for $string =~ m/(AA+|CC+|GG+|TT+)/g; }, # v3 sub { %subs = (); $subs{$1}++ while $string =~ m{(([ACGT])\2+)}g; }, # v4 sub { %subs = (); $subs{$&}++ while $string =~ m{([ACGT])\1+}g; }, # v5 sub { %subs = (); $subs{$&}++ while $string =~ m{A{2,}|C{2,}|G{2,}|T{2,}}g; }, # v6 sub { %subs = (); $subs{$&}++ while $string =~ m{AA+|CC+|GG+|TT+}g; }, # v7 sub { %subs = (); $subs{$&}++ while $string =~ m{()AA+|CC+|GG+|TT+}g; }, # v8 ); for (@v) { $f{$_}->(); is_deeply (\%subs, \%expect, $_); } printf "%5d %3d %s\n", $subs{$_->[1]}, @$_ for sort { $b->[0] <=> $a-> +[0] || $a->[1] cmp $b->[1] } map {[ length, $_ ]} keys %subs; cmpthese (-2, { map {( $_ => $f{$_} )} @v }); done_testing;
      $ test.pl 1 ok 1 - v1 ok 2 - v2 ok 3 - v3 ok 4 - v4 ok 5 - v5 ok 6 - v6 ok 7 - v7 ok 8 - v8 1 6 CCCCCC 1 4 GGGG 1 3 AAA 3 3 TTT 2 2 AA 2 2 GG 5 2 TT Rate v2 v1 v7 v3 v4 v5 v6 v8 v2 41819/s -- -30% -45% -53% -57% -58% -60% -63% v1 60150/s 44% -- -21% -32% -38% -40% -43% -47% v7 76560/s 83% 27% -- -13% -22% -23% -28% -32% v3 88071/s 111% 46% 15% -- -10% -12% -17% -22% v4 97745/s 134% 63% 28% 11% -- -2% -8% -13% v5 99555/s 138% 66% 30% 13% 2% -- -6% -12% v6 105700/s 153% 76% 38% 20% 8% 6% -- -6% v8 112783/s 170% 88% 47% 28% 15% 13% 7% -- 1..8
      $ test.pl 20 ok 1 - v1 ok 2 - v2 ok 3 - v3 ok 4 - v4 ok 5 - v5 ok 6 - v6 ok 7 - v7 ok 8 - v8 20 6 CCCCCC 20 4 GGGG 20 3 AAA 60 3 TTT 40 2 AA 40 2 GG 100 2 TT Rate v2 v1 v7 v3 v4 v5 v6 v8 v2 2327/s -- -29% -47% -52% -55% -57% -61% -65% v1 3284/s 41% -- -26% -32% -37% -39% -45% -50% v7 4419/s 90% 35% -- -9% -15% -17% -26% -33% v3 4853/s 109% 48% 10% -- -7% -9% -18% -27% v4 5215/s 124% 59% 18% 7% -- -3% -12% -21% v5 5351/s 130% 63% 21% 10% 3% -- -10% -19% v6 5934/s 155% 81% 34% 22% 14% 11% -- -10% v8 6604/s 184% 101% 49% 36% 27% 23% 11% -- 1..8
      $ test.pl 2000 ok 1 - v1 ok 2 - v2 ok 3 - v3 ok 4 - v4 ok 5 - v5 ok 6 - v6 ok 7 - v7 ok 8 - v8 2000 6 CCCCCC 2000 4 GGGG 2000 3 AAA 6000 3 TTT 4000 2 AA 4000 2 GG 10000 2 TT Rate v2 v1 v7 v3 v4 v5 v6 v8 v2 21.3/s -- -35% -50% -54% -60% -61% -64% -68% v1 32.7/s 54% -- -23% -30% -38% -39% -45% -51% v7 42.6/s 100% 30% -- -9% -19% -21% -28% -36% v3 46.6/s 119% 42% 9% -- -12% -14% -21% -30% v4 52.7/s 147% 61% 24% 13% -- -2% -11% -21% v5 54.0/s 154% 65% 27% 16% 3% -- -9% -19% v6 59.2/s 178% 81% 39% 27% 13% 10% -- -11% v8 66.3/s 212% 103% 56% 42% 26% 23% 12% -- 1..8

      Enjoy, Have FUN! H.Merijn
      you might want to run the different variants thru use re "debug" to see how they are translated into regex primitives. This might give you a clue what is happening.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      Note also that the benchmark results may be different for other input strings. The one in the OP is short and all the same-char substrings are also short, so for instance, results may be different if you use a long string containing long same-char substrings.

        Yes indeed. Benchmarking correctly is hard :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11122323]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-04-24 06:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found