Re: substrings that consist of repeating characters

I was surprised to see the $& being the fastest. At least on my perl-5.28:

my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT
+GGGGACTTT";
my %expect = qw( CCCCCC 1 GGGG 1 AAA 1 TTT 3 AA 2 GG 2 TT 5 );

use Test::More;
use Benchmark qw(cmpthese);

my %subs;

sub v1 {
    %subs = ();
    $subs{$_}++ for grep { length >= 2 } split m/,/ => ($string =~ s/(
+[ACGT])\K(?!\1)/,/gr);
    } # v1

sub v2 {
    %subs = ();
    $subs{$_}++ for grep m/^([ACGT])\1+$/ => split m/,/ => ($string =~
+ s/(\w)\K(?!\1)/,/gr);
    } # v2

sub v3 {
    %subs = ();
    $subs{$_}++ for $string =~ m/(AA+|CC+|GG+|TT+)/g;
    } # v3

sub v4 {
    %subs = ();
    $subs{$1}++ while $string =~ m{(([ACGT])\2+)}g;
    } # v4

sub v5 {
    %subs = ();
    $subs{$&}++ while $string =~ m{([ACGT])\1+}g;
    } # v5

v1 (); is_deeply (\%subs, \%expect, "v1");
v2 (); is_deeply (\%subs, \%expect, "v2");
v3 (); is_deeply (\%subs, \%expect, "v3");
v4 (); is_deeply (\%subs, \%expect, "v4");
v5 (); is_deeply (\%subs, \%expect, "v5");

printf "%5d %3d %s\n", $subs{$_->[1]}, @$_ for sort { $b->[0] <=> $a->
+[0] || $a->[1] cmp $b->[1] } map {[ length, $_ ]} keys %subs;

cmpthese (-2, { v1 => \&v1, v2 => \&v2, v3 => \&v3, v4 => \&v4, v5 => 
+\&v5 });

done_testing;
[download]

ok 1 - v1
ok 2 - v2
ok 3 - v3
ok 4 - v4
ok 5 - v5
    1   6 CCCCCC
    1   4 GGGG
    1   3 AAA
    3   3 TTT
    2   2 AA
    2   2 GG
    5   2 TT
      Rate   v2   v1   v3   v4   v5
v2 41981/s   -- -30% -52% -56% -57%
v1 59864/s  43%   -- -31% -38% -39%
v3 87244/s 108%  46%   --  -9% -12%
v4 95919/s 128%  60%  10%   --  -3%
v5 98685/s 135%  65%  13%   3%   --
1..5
[download]

Enjoy, Have FUN! H.Merijn

Comment on Re: substrings that consist of repeating characters Select or Download Code

Replies are listed 'Best First'.
Re^2: substrings that consist of repeating characters by salva (Canon) on Sep 29, 2020 at 12:41 UTC
IIRC, once perl sees `$&` anywhere in the program code, it starts to populate that variable (and `$'` and $`) for all the regular expression matches in the program. Using it impacts the performance of all the regular expressions in the code, not just those ones where it is actually needed!	[reply] [d/l] [select]
Re^3: substrings that consist of repeating characters by Eily (Monsignor) on Sep 29, 2020 at 13:16 UTC
perlvar does mention the issue, but it also says this has been fully fixed since v5.20. Edit: so this would mean that you might still get the same relative positions for the different versions on older version of perls, because although `$&` would be significantly worse than the other solutions on their own, it would actually lower the performances of all other versions when used in the benchmark.	[reply] [d/l]
Re^2: substrings that consist of repeating characters by Eily (Monsignor) on Sep 29, 2020 at 13:20 UTC
Edit: I thought I had a better version but no. I ran the same benchmark again and the results were not the same at all (actually the three solutions had very similar performances). Something went wrong with my first attempt Read more... (773 Bytes) I'm actually consistantly getting result that are worse without backreferences which which I don't understand...	[reply] [d/l] [select]
Re^3: substrings that consist of repeating characters by Tux (Canon) on Sep 29, 2020 at 15:47 UTC
my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT"; my %expect = qw( CCCCCC 1 GGGG 1 AAA 1 TTT 3 AA 2 GG 2 TT 5 ); my $n = shift // 1; if ($n > 1) { $string = $string x $n; $_ *= $n for values %expect; } use Test::More; use Benchmark qw(cmpthese); my %subs; my @v = map { "v$_" } 1 .. 8; my %f; @f{@v} = ( sub { %subs = (); $subs{$_}++ for grep { length >= 2 } split m/,/ => ($string =~ s/( +[ACGT])\K(?!\1)/,/gr); }, # v1 sub { %subs = (); $subs{$_}++ for grep m/^([ACGT])\1+$/ => split m/,/ => ($string =~ + s/(\w)\K(?!\1)/,/gr); }, # v2 sub { %subs = (); $subs{$_}++ for $string =~ m/(AA+\|CC+\|GG+\|TT+)/g; }, # v3 sub { %subs = (); $subs{$1}++ while $string =~ m{(([ACGT])\2+)}g; }, # v4 sub { %subs = (); $subs{$&}++ while $string =~ m{([ACGT])\1+}g; }, # v5 sub { %subs = (); $subs{$&}++ while $string =~ m{A{2,}\|C{2,}\|G{2,}\|T{2,}}g; }, # v6 sub { %subs = (); $subs{$&}++ while $string =~ m{AA+\|CC+\|GG+\|TT+}g; }, # v7 sub { %subs = (); $subs{$&}++ while $string =~ m{()AA+\|CC+\|GG+\|TT+}g; }, # v8 ); for (@v) { $f{$_}->(); is_deeply (\%subs, \%expect, $_); } printf "%5d %3d %s\n", $subs{$_->[1]}, @$_ for sort { $b->[0] <=> $a-> +[0] \|\| $a->[1] cmp $b->[1] } map {[ length, $_ ]} keys %subs; cmpthese (-2, { map {( $_ => $f{$_} )} @v }); done_testing; [download] $ test.pl 1 ok 1 - v1 ok 2 - v2 ok 3 - v3 ok 4 - v4 ok 5 - v5 ok 6 - v6 ok 7 - v7 ok 8 - v8 1 6 CCCCCC 1 4 GGGG 1 3 AAA 3 3 TTT 2 2 AA 2 2 GG 5 2 TT Rate v2 v1 v7 v3 v4 v5 v6 v8 v2 41819/s -- -30% -45% -53% -57% -58% -60% -63% v1 60150/s 44% -- -21% -32% -38% -40% -43% -47% v7 76560/s 83% 27% -- -13% -22% -23% -28% -32% v3 88071/s 111% 46% 15% -- -10% -12% -17% -22% v4 97745/s 134% 63% 28% 11% -- -2% -8% -13% v5 99555/s 138% 66% 30% 13% 2% -- -6% -12% v6 105700/s 153% 76% 38% 20% 8% 6% -- -6% v8 112783/s 170% 88% 47% 28% 15% 13% 7% -- 1..8 [download] $ test.pl 20 ok 1 - v1 ok 2 - v2 ok 3 - v3 ok 4 - v4 ok 5 - v5 ok 6 - v6 ok 7 - v7 ok 8 - v8 20 6 CCCCCC 20 4 GGGG 20 3 AAA 60 3 TTT 40 2 AA 40 2 GG 100 2 TT Rate v2 v1 v7 v3 v4 v5 v6 v8 v2 2327/s -- -29% -47% -52% -55% -57% -61% -65% v1 3284/s 41% -- -26% -32% -37% -39% -45% -50% v7 4419/s 90% 35% -- -9% -15% -17% -26% -33% v3 4853/s 109% 48% 10% -- -7% -9% -18% -27% v4 5215/s 124% 59% 18% 7% -- -3% -12% -21% v5 5351/s 130% 63% 21% 10% 3% -- -10% -19% v6 5934/s 155% 81% 34% 22% 14% 11% -- -10% v8 6604/s 184% 101% 49% 36% 27% 23% 11% -- 1..8 [download] $ test.pl 2000 ok 1 - v1 ok 2 - v2 ok 3 - v3 ok 4 - v4 ok 5 - v5 ok 6 - v6 ok 7 - v7 ok 8 - v8 2000 6 CCCCCC 2000 4 GGGG 2000 3 AAA 6000 3 TTT 4000 2 AA 4000 2 GG 10000 2 TT Rate v2 v1 v7 v3 v4 v5 v6 v8 v2 21.3/s -- -35% -50% -54% -60% -61% -64% -68% v1 32.7/s 54% -- -23% -30% -38% -39% -45% -51% v7 42.6/s 100% 30% -- -9% -19% -21% -28% -36% v3 46.6/s 119% 42% 9% -- -12% -14% -21% -30% v4 52.7/s 147% 61% 24% 13% -- -2% -11% -21% v5 54.0/s 154% 65% 27% 16% 3% -- -9% -19% v6 59.2/s 178% 81% 39% 27% 13% 10% -- -11% v8 66.3/s 212% 103% 56% 42% 26% 23% 12% -- 1..8 [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: substrings that consist of repeating characters by LanX (Saint) on Sep 29, 2020 at 15:45 UTC
you might want to run the different variants thru `use re "debug"` to see how they are translated into regex primitives. This might give you a clue what is happening. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^3: substrings that consist of repeating characters by salva (Canon) on Sep 29, 2020 at 14:56 UTC
Note also that the benchmark results may be different for other input strings. The one in the OP is short and all the same-char substrings are also short, so for instance, results may be different if you use a long string containing long same-char substrings.	[reply]
Re^4: substrings that consist of repeating characters by Eily (Monsignor) on Sep 29, 2020 at 15:13 UTC
Yes indeed. Benchmarking correctly is hard :)	[reply]


Syntactic Confectionery Delight
	PerlMonks