Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^3: how to count the number of repeats in a string (really!)

by Krambambuli (Curate)
on Nov 14, 2007 at 17:56 UTC ( [id://650810]=note: print w/replies, xml ) Need Help??


in reply to Re^2: how to count the number of repeats in a string (really!)
in thread how to count the number of repeats in a string (really!)

I've tried to smoothen the differences between your non-recursive but using regexp solution and mine, which is recursive but doesn't use any RX.

After benchmarking, you're the clear winner:
Benchmark: timing 2000 iterations of Krambambuli, blazar... Krambambuli: 4 wallclock secs ( 3.28 usr + 0.01 sys = 3.29 CPU) @ 6 +07.90/s (n=2000) blazar: 1 wallclock secs ( 0.80 usr + 0.00 sys = 0.80 CPU) @ 25 +00.00/s (n=2000)
Congrats! :)

Here's the code I've used for the benchmarking.
use strict; use warnings; use Benchmark; use Data::Dumper; use constant MIN => 2; my $str = 'aabcdabcabcecdecd'; my $min_length = 2; my $min_count = 2; timethese(2000, { 'blazar' => sub { count1( $str ) }, 'Krambambuli' => sub { count2( $str ) }, }); { my %count; sub count1 { my( $string) = @_; my $length = length( $string ); if ($length < MIN) { $count{$_} == 1 and delete $count{$_} for keys %count; return \%count; } for my $l (MIN..$length) { my $s = substr( $string, 0, $l ); $count{ $s } += 1; } count1( substr( $string, 1 ) ); } } sub count2 { local $_=shift; my $l=length; my %count; for my $off (0..$l-1) { for my $len (MIN .. $l-$off) { my $s=substr $_, $off, $len; $count{ $s } ||= ()= /$s/g; } } $count{$_} == 1 and delete $count{$_} for keys %count; \%count; }

Update: Actually, it's in fact the other way round, my code is faster - I've just named the benchmarked subs wrongly. Duh! Sorry.

Krambambuli
---
enjoying Mark Jason Dominus' Higher-Order Perl

Replies are listed 'Best First'.
Re^4: how to count the number of repeats in a string (really!)
by blazar (Canon) on Nov 14, 2007 at 22:08 UTC
    After benchmarking, you're the clear winner:

    I personally believe that I'm not! ;)

    Anyway, I extended the benchmark to include oha's solution (in my version) and ikegami's. For definiteness I made sure to compare subs that do exactly the same thing: accept a string and return a hashref of the counts, and that the counts are those of strings of length 2 or more repeated 2 or more times. I could not include lodin's solution because I haven't the slightest idea of how to modify it so that only strings of length 2 or more are taken into account. (Short of deleting thos of length 1.) Admittedly, it would be interesting to see how the benchmark goes with different data sets...

    Results:

    Rate blazar kramba ikegami oha blazar 405/s -- -77% -89% -89% kramba 1788/s 341% -- -52% -52% ikegami 3743/s 823% 109% -- -0% oha 3743/s 823% 109% 0% --

    Update: I included lodin's solution as per his direction as follows:

    I also corrected the kramba subroutine as per lodin's remark:

    The new results are:

    Rate blazar kramba oha ikegami lodin blazar 577/s -- -72% -90% -91% -91% kramba 2045/s 254% -- -66% -67% -69% oha 6039/s 946% 195% -- -2% -9% ikegami 6154/s 966% 201% 2% -- -8% lodin 6667/s 1055% 226% 10% 8% --

    Update: lodin draws my attention by /msg on the fact that the closure leaks and he directs me to his own Sub::Recursive both for an explanation and a fix. He also notices that he didn't include the manual fix in the docs. Anyway, one can use local our $count; in this particular instance. I'm doing a new benchmark which will be posted in a separate node taking also this into account, although I don't think it will make that much of a difference.

      To include mine, just change .+ to .{N,} where N is the minimum length.

      lodin

      The kramba subroutine suffers from an accumulating %count. It can be fixed by doing %count = () and returning { %count } (a copy of %count); making %count global and localizing it; or my favourite: write the recursive subroutine as an anonymous closure in kramba.

      lodin

      Admittedly, it would be interesting to see how the benchmark goes with different data sets...
      I used the following text, not big but should be enough
      my $s = 'Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi.'; $s = s/[\r\n]//g;
      After reading the code of ikegami and lodin, "I personally believe"d they were faster but i was wrong:
      s/iter ikegami lodin oha ikegami 1.43 -- -1% -48% lodin 1.42 1% -- -48% oha 0.740 93% 92% --
      I suspect the regex engine is so much smarter in finding fixed char repetition to gain more then the gain of doing lots regex call...

      Oha

      PS: i didn't passed the string to the subs and didn't returned the results.

      Update: removed the \n from the string, updated results

      Update: my code is broken: it must reset pos to the previous pos+1 (if not some subpatterns aren't matched). updated results are as "i personally believe"d:

      s/iter oha lodin ikegami oha 1.66 -- -12% -14% lodin 1.46 14% -- -2% ikegami 1.43 16% 2% --
      I've made just a few changes (mostly to my sub, so that it comes closer to my original design - your modifications added a few unnecessary steps) and run the benchmarks again.

      The results seem to support my original suspicion that - at least for this particular problem - a regexp based solution would have to loose the fight against an approach that never has to look behind or ahead, but just touches every possible substring exactly 1 time:
      Results for string: "aabcdabcabcecdecd " Rate blazar kramba ikegami lodin oha blazar 464/s -- -77% -89% -90% -90% kramba 2062/s 344% -- -52% -54% -58% ikegami 4255/s 817% 106% -- -4% -13% lodin 4444/s 858% 116% 4% -- -9% oha 4878/s 951% 137% 15% 10% -- Results for string: "aabcdabcabcecdecd aabcdabcabcecdecd " Rate blazar oha kramba ikegami lodin blazar 92.4/s -- -85% -86% -86% -87% oha 610/s 560% -- -7% -9% -14% kramba 658/s 612% 8% -- -1% -7% ikegami 667/s 621% 9% 1% -- -6% lodin 709/s 667% 16% 8% 6% -- Results for string: "aabcdabcabcecdecd aabcdabcabcecdecd aabcdabcabcecdecd aabcdabcabcecde +cd " Rate blazar lodin ikegami oha kramba blazar 21.4/s -- -85% -86% -86% -90% lodin 144/s 574% -- -3% -5% -35% ikegami 148/s 594% 3% -- -2% -33% oha 151/s 607% 5% 2% -- -31% kramba 220/s 930% 53% 48% 46% --
      Here's the full code I've used for benchmarking. Another mention I'd make is that if some changes would be needed to the subs - like for example considering at least MIN_REPEATS repetitions of a string to be counted - I'm afraid it might be rather challenging in modifying the RX-ish solutions.
      Speaking for me, I wouldn't know how to make it in the code above, even if I think of me as not being a novice any more when dealing with regular expressions.

      Update Ahmm... there are some things broken, and I'll have to find out which one. Checking results for simple cases looked ok, so I thought things are ok. But then trying to run the benchmark for the longer text that Oha proposed, I noticed some problems and so tried to just output the _count_ of strings retained by each sub for a string like

      'aabcdabcabcecdecd aabcdabcabcecdecd aabcdabcabcecdecd aabcdabcabcecdecd '

      Much to my surprize, that came out as

      467,337,791,467,467

      for respectively blazar, oha, kramba, ikegami, lodin. Ooops...

      How was that: who has a clock, knows what the time is, who has 2 clocks, has a problem... :)

      Update 2 With Oha's longer latin text, the counts are - in the same order as above - 419,244,371,371,371 and my little recursive beauty complains about 'Deep recursion on subroutine "main::kramba" at ./test.pl line 95'. Well, understandable...


      Krambambuli
      ---
      enjoying Mark Jason Dominus' Higher-Order Perl

        Another mention I'd make is that if some changes would be needed to the subs - like for example considering at least MIN_REPEATS repetitions of a string to be counted - I'm afraid it might be rather challenging in modifying the RX-ish solutions.

        I've updated my original reply to include minimum substring length and minimum matched patterns. It would be interesting to see our and the other subroutines benchmarked against each other with this addition.

        lodin

        Another mention I'd make is that if some changes would be needed to the subs - like for example considering at least MIN_REPEATS repetitions of a string to be counted - I'm afraid it might be rather challenging in modifying the RX-ish solutions.

        One simple solution is to go back to my (unposted) original solution and check MIN_REPEATS outside.

        use constant MIN_REPEATS => 2; # Must have at least this many repeats my $str = 'aabcdabcabcecdecd'; local our %counts; $str =~ /(.{2,})(?{ ++$counts{$1} })(?!)/s; delete @counts{ grep $counts{$_}<MIN_REPEATS, keys %counts }; use Data::Dumper; print Dumper \%counts;

        That way, there's nothing complicated left in the regex. The only part that's modifiable is /.{2,}/, which is easier to understand and modify than hand-rolled parsing code.

        a regexp based solution would have to loose the fight against an approach that never has to look behind or ahead, but just touches every possible substring exactly 1 time

        Not so! The above regex never has to look beind or ahead and touches every possible substring exactly 1 time.

        I've made just a few changes (mostly to my sub, so that it comes closer to my original design - your modifications added a few unnecessary steps) and run the benchmarks again.

        I personally believe that those steps were not all that unnecessary. To be definite, I chose to compare subs that accept a string and return a hashref of the counts. Yours doesn't, so some extra step is required. According to the last update to Re^4: how to count the number of repeats in a string (really!), I'm posting a new benchmark here, with your sub as a thin layer around the recursive sub. I hope that is fine...

        So, instead of the benchmark, for the moment I'm posting the script with the tests:

        15 tests out 18 fail.

        Update: I'm an idiot! All tests run smoothly once I use the correct reference. The lesson in this is: don't post when you're too tired. I'm going to sleep and I will update the node with the actual benchmark tomorrow morning... Sorry for the noise!

        Update: Ok, I woke up and I'm not that tired anymore. Here's the complete script:

        And here's the output:

        Krambambuli seems to win in the long string case. Anyway, analysis? Refinements, comments, additions?

        Update: well done, lodin++.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://650810]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-26 03:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found