In response to your postscript, a regex with the
/g modifier in list context returns a list of the matches (or a list of the captures for each match). The
() in
() = $line =~ /.../g enforces list context, even though you're assigning to an empty list. The other half of the trick is that list assignment in scalar context returns the number of elements being assigned (not being assigned TO), so
my $count = () = $line =~ /.../g stores the number of elements returned by
$line =~ /.../g.
Here's my "regex for the sake of regex" solution:
m{
(?{ [ {}, 0 ] })
^
\d*?
(\d+)
(?(?{
length($1) > (length($_) - $-[1])/2
or $^R->[0]{$1}++
})(?!))
(?{ [ $^R->[0], 1 ] })
(?>
(?: \d*? \1 (?{ [ $^R->[0], $^R->[1] + 1 ] }) )+
(?{ print report($1, $^R->[1]) })
)
(?!)
}x;
sub report {
my ($str, $count) = @_;
sprintf "length: %d digits: %s quantity: %d\n",
length($str), $str, $count;
}
Dissection will come later. For now, just breathe it in.
Update: here is the promised dissection. The meat of this program is simply the regex; the report() function just prints out the diagnostic information. First, here's the regex with comments.
m{
# set up $^R to hold a hashref and a number.
# anchor the regex to the beginning of the
# string this will ensure that we stop
# processing once we've exhausted the string.
(?{ [ {}, 0 ] })
^
# we select a run of digits (greedy) after
# some prefix of digits (non-greedy). the
# run is stored in $1
\d*? (\d+)
# we make sure $1 is short enough that
# it COULD be repeated (a 10 character $1
# in a 15 character string is no good).
# we also check to see if we've seen this
# particular run of digits before. this is
# analogous to grep { $seen{$_}++ } LIST.
# $^R->[0], the hashref, is our "seen" hash.
# if $1 has been seen, we fail via (?!),
# which causes us to backtrack our (\d+).
(?(?{
length($1) > (length($_) - $-[1])/2
or $^R->[0]{$1}++
})(?!))
# here we set the "count" part of $^R to 1.
# if you're wondering why I don't just do
# $^R->[1]++ (like I did for $^R->[0]{$1})
# I'll explain later. suffice to say, this
# is the proper and safe way to do it.
(?{ [ $^R->[0], 1 ] })
# we use (?>...) to prevent backtracking
# inside this part. this part tracks the
# number of times $1 is repeated in the
# rest of the string. we only want the
# most number of matches, so we don't want
# to backtrack inside here.
(?>
# we look for the non-$1 part followed
# by $1, one or more times (since we
# are only interested in $1's that ARE
# repeated). for each match of $1 we
# find, we add 1 to $^R->[1], the safe
# way.
(?:
\d*? \1
(?{ [ $^R->[0], $^R->[1] + 1 ] })
)+
# when we can't find anymore $1's, we
# report on it.
(?{ print report($1, $^R->[1]) })
)
# we force the regex to fail here, which
# causes the entire thing to backtrack,
# which ensures we try each substring of
# the entire string.
(?!)
}x;
The reason doing
$^R->[1]++ is wrong is because $^R is a magical variable that automatically "rolls back" (like a transaction table) when assertions are backtracked past. However, this only works
by keeping a stack of the return values from
(?{ ... }) assertions; in case you didn't know, the last value evaluated in
(?{ ... }) is pushed onto the "$^R stack", as the current value to use when $^R is used. By ASSIGNING TO $^R DIRECTLY, you circumvent this magic! Why, then, was it ok to do
$^R->[0]{$1}++? Simply because the "seen" hash is not something I want rolled back. It has to be persistent through the entirety of the regex.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.