Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: how to count the number of repeats in a string (really!) [regexp solution]

by lodin (Hermit)
on Nov 14, 2007 at 18:59 UTC ( [id://650826]=note: print w/replies, xml ) Need Help??


in reply to how to count the number of repeats in a string (really!)

This is a perfect task for the regex engine.

local our %count; $str =~ / (.+) # or .{N,} where N is minimum length. (?(?{ $count{$1} }) (?!) ) .* \1 (?{ ($count{$1} ||= 1)++ }) (?!) /x;
A more generalized version where you can specify the minimum substring length and minimum number of occurances is
my $min_len = 2; # Substring is at least two chars long. my $min_count = 3; # Substring occures at least three times. local our %count; use re 'eval'; $str =~ / (.{$min_len,}) (?(?{ $count{$1} }) (?!) ) (?> .*? \1 ){@{[ $min_count - 2 ]}} .* \1 (?{ ($count{$1} ||= $min_count-1)++ }) (?!) /x;

lodin

Update:

While writing this, ikegami posted a very similar-looking reply. While they look very much alike they work quite differently. ikegami's work by requiring that each match is repeated further into the string, and then goes on to count all those successive matches kind of like a global match. Mine does the counting right away, and then forces the engine to not count those again. So they work rather opposite of each other.

I did a shallow benchmark. It seems that mine is a slight favourite (5-10%) in many, but not all, situations. I also get the impression that ikegami's scales slightly better, see Re^5: how to count the number of repeats in a string (really!).

Update 2:

Added the comment in the regex.

Update 3:

Added the generalized version.

Replies are listed 'Best First'.
Re^2: how to count the number of repeats in a string (really!) [regexp solution]
by ikegami (Patriarch) on Nov 16, 2007 at 20:41 UTC
    You made the same mistake I did. It fails to find two 'aba' in 'ababa'.

      Mistake and mistake. That's not how I interpreted the question. The problem statement is a bit vague on this. For practical purposes I don't think 'aaa' should report 2 x 'aa' as well.

      If this is the task however, it's better to just exhaustively match everything of a minimum length. The problem as I (and you?) interpreted it is actually more interesting, I think.

      lodin

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://650826]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-26 08:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found