Sewi has asked for the wisdom of the Perl Monks concerning the following question:
Dear monks,
a really simple RegEx problem is driving me crazy trying to find the most-often recurring char combination:
perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $2 if $x
+ =~ /((\w+).*?\2.*?)+/;'
abcdefgxxabcdefgzzabcdsjfhkdfab
ab
Simple, but I need to match only strings with a minimum of 2 chars:
perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $2 if $x
+ =~ /((\w{2,}).*?\2.*?)+/;'
abcdefgxxabcdefgzzabcdsjfhkdfab
abcdefg
Shows abcdefg but ab would be right (there is another abc behind the zz and another ab at the end).
perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $2 if $x
+ =~ /((\w{2,}?).*?\2.*?)+/;'
abcdefgxxabcdefgzzabcdsjfhkdfab
cd
doesn't work either, as ab would be right.
Why doesn't {2,} match ab?
Thanks,
Sewi
Re: RegEx + vs. {1,}
by grizzley (Chaplain) on Oct 10, 2012 at 12:01 UTC
|
Because you are trying to match string which occurs twice and 'abcdefg' is first correct candidate. I can think only about such approach for your problem:
$ perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $1 if
+$x =~ /(\w{2,})(.*?\1){2}/;'
abcdefgxxabcdefgzzabcdsjfhkdfab
abcd
$ perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $1 if
+$x =~ /(\w{2,})(.*?\1){3}/;'
abcdefgxxabcdefgzzabcdsjfhkdfab
ab
$ perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $1 if
+$x =~ /(\w{2,})(.*?\1){4}/;'
abcdefgxxabcdefgzzabcdsjfhkdfab
I.e. The problem is you have to say exactly how many occurences you want (there is no 'greediness' in this case == you can't say "I want as many occurences as possible", only lowest possible number of occurences will be chosen). | [reply] [d/l] |
|
Thanks, you can't say "I want as many occurences as possible" seems to be my problem.
| [reply] |
|
So if that's acceptable for you - use while loop to determine max amount of occurences. There will be no more than length / 2 occurences, so start with this max value and decrease it while trying to match:
$x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; $len=int(length($x)/2);
while($x !~ /(\w{2,})(.*?\1){$len}/)
{ $len-- };
$x =~ /(\w{2,})(.*?\1){$len}/; # 'strange line'
print $1
(to self: do not know why I have to add 'strange line', without it nothing is printed, but $len is correctly set to 4)
I tried to generate the list and include it in one regexp:
$ perl -le '$x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; $len=int(length($x
+)/2); $restring = join"|", map {"(?:.*?\\1){$_}"} reverse(1..$len); p
+rint $restring; print $1 if $x =~ /(\w{2,})($restring)/;'
(?:.*?\1){15}|(?:.*?\1){14}|(?:.*?\1){13}|(?:.*?\1){12}|(?:.*?\1){11}|
+(?:.*?\1){10}|(?:.*?\1){9}|(?:.*?\1){8}|(?:.*?\1){7}|(?:.*?\1){6}|(?:
+.*?\1){5}|(?:.*?\1){4}|(?:.*?\1){3}|(?:.*?\1){2}|(?:.*?\1){1}
abcdefg
but it does not work as expected (probably some stupid mistake, maybe someone else can tell what's wrong with it). | [reply] [d/l] [select] |
Re: RegEx + vs. {1,}
by trizen (Hermit) on Oct 10, 2012 at 12:25 UTC
|
my $string = 'abcdefgxxabcdefgzzabcdsjfhkdfab';
my %seen;
1 while $string =~ /(\w{2,}?)(?{$seen{$1}++})(?!)/g;
print [sort {
$seen{$b} <=> $seen{$a}
|| length($b) <=> length($a)
} keys %seen]->[0];
Update: Thanks Athanasius for your appreciation && corrections. | [reply] [d/l] |
|
my %seen;
for my $len (2 .. length $string)
{
++$seen{$1}, pos($string) -= length($1) - 1 while $string =~ /(\w{
+$len})/g;
}
but this is clumsy and verbose beside your elegant one-liner!
Took me a while to figure out what (?!) was doing. YAPE::Regex::Explain told me:
----------------------------------------------------------------------
(?{$seen{$1}++}) run this block of Perl code
----------------------------------------------------------------------
(?!) fail
----------------------------------------------------------------------
and in Extended Patterns I found:
(*FAIL) (*F)
This pattern matches nothing and always fails. It can be used to force
+ the engine to backtrack. It is equivalent to (?!), but easier to rea
+d. In fact, (?!) gets optimised into (*FAIL) internally.
So I’ve learned something that’s really useful.
Now a couple of minor quibbles:
- The non-greedy quantifier in (\w{2,}?) is redundant, since the forced backtracking ensures that all substrings will be found whether the search is greedy or not.
- If, as per your sort, you are looking for only the shortest substring with the maximum frequency, then you need only search for substrings exactly 2 characters long (the minimum length): (\w{2}). (Proof: Say XYZ occurs N times. XY and YZ occur once in each occurrence of XYZ, so each is guaranteed to occur at least N times in the string. Similar reasoning applies to longer substrings.)
++trizen for demonstrating this elegant technique to force regex backtracking!
Update: Because (?!) forces backtracking via repeatedly-failing matches, neither the while loop nor the /g modifier are necessary:
$string =~ /(\w{2,})(?{$seen{$1}++})(?!)/;
does the identical job! (as is easily confirmed by dumping the contents of %seen).
Athanasius <°(((>< contra mundum
| [reply] [d/l] [select] |
Re: RegEx + vs. {1,}
by MidLifeXis (Monsignor) on Oct 10, 2012 at 12:32 UTC
|
| [reply] |
Re: RegEx + vs. {1,}
by Anique (Acolyte) on Oct 10, 2012 at 12:26 UTC
|
The problem is not with the (\w{2,}?), because it will happily match to ab in
$x =~ /((\w{2,}?).*\g2.*?)+/;
and
$x =~ /((\w{2,}?).*?\g2.*?)+?/;
You are asking the regex to give you a sequence which is repeated, and you can either get the longest match (greedy) or the shortest one (ungreedy)
The only solution for your problem that I can think of, is to capture all matches that this regex finds (list context), then counting how often each captured sequence occurs, and selecting the one that occurs most often.
| [reply] [d/l] [select] |
Re: RegEx + vs. {1,}
by ELISHEVA (Prior) on Oct 10, 2012 at 14:16 UTC
|
If you want a list of all two letter patterns that appear at least twice somewhere in your string, you need to make three changes to your regex.
- you need to make (\w{2,}) non-greedy by adding a "?" to the end, e.g. (\w{2,}?).
- you need to wrap what comes after (\w{2,}?) with a zero width lookahead group. Otherwise you will miss all the matches between the first and second occurrence of "ab"
- you need to handle repetitions of your regex slightly differently. Instead of /( mumblefoo )+/ you need /mumblefoo/g. Using a + the way you did will only get you the last match found because each time the + causes the regex to repeat, it replaces the previous match.
Taken together these changes will make your regex will look like this: /(\w{2,}?)(?=.*?\1)/g:
print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab", "\n";
print "<" . join('|',$x =~ /(\w{2,}?)(?=.*?\1)/g) , ">\n";
#outputs: <ab|cd|ef|ab|cd|ab>
You can more info on zerolength lookaheads via the Extended Patterns section of the perlre manpage on perldoc | [reply] [d/l] [select] |
|
Or, just remove the comma from the quantifier: \w{2}
| [reply] [d/l] |
|
| [reply] |
|
|
|