RegEx + vs. {1,}

Sewi has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: RegEx + vs. {1,} by grizzley (Chaplain) on Oct 10, 2012 at 12:01 UTC
Because you are trying to match string which occurs twice and 'abcdefg' is first correct candidate. I can think only about such approach for your problem: `$ perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $1 if +$x =~ /(\w{2,})(.?\1){2}/;' abcdefgxxabcdefgzzabcdsjfhkdfab abcd $ perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $1 if +$x =~ /(\w{2,})(.?\1){3}/;' abcdefgxxabcdefgzzabcdsjfhkdfab ab $ perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $1 if +$x =~ /(\w{2,})(.*?\1){4}/;' abcdefgxxabcdefgzzabcdsjfhkdfab` [download] I.e. The problem is you have to say exactly how many occurences you want (there is no 'greediness' in this case == you can't say "I want as many occurences as possible", only lowest possible number of occurences will be chosen).	[reply] [d/l]
Re^2: RegEx + vs. {1,} by Sewi (Friar) on Oct 10, 2012 at 13:01 UTC
Thanks, you can't say "I want as many occurences as possible" seems to be my problem. Try out Padre - the free Perl IDE and visit my blog's Perl section.	[reply]
Re^3: RegEx + vs. {1,} by grizzley (Chaplain) on Oct 10, 2012 at 15:11 UTC
So if that's acceptable for you - use while loop to determine max amount of occurences. There will be no more than length / 2 occurences, so start with this max value and decrease it while trying to match: `$x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; $len=int(length($x)/2); while($x !~ /(\w{2,})(.?\1){$len}/) { $len-- }; $x =~ /(\w{2,})(.?\1){$len}/; # 'strange line' print $1` [download] (to self: do not know why I have to add 'strange line', without it nothing is printed, but $len is correctly set to 4) I tried to generate the list and include it in one regexp: `$ perl -le '$x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; $len=int(length($x +)/2); $restring = join"\|", map {"(?:.?\\1){$_}"} reverse(1..$len); p +rint $restring; print $1 if $x =~ /(\w{2,})($restring)/;' (?:.?\1){15}\|(?:.?\1){14}\|(?:.?\1){13}\|(?:.?\1){12}\|(?:.?\1){11}\| +(?:.?\1){10}\|(?:.?\1){9}\|(?:.?\1){8}\|(?:.?\1){7}\|(?:.?\1){6}\|(?: +.?\1){5}\|(?:.?\1){4}\|(?:.?\1){3}\|(?:.?\1){2}\|(?:.?\1){1} abcdefg` [download] but it does not work as expected (probably some stupid mistake, maybe someone else can tell what's wrong with it).	[reply] [d/l] [select]
Re: RegEx + vs. {1,} by trizen (Hermit) on Oct 10, 2012 at 12:25 UTC
`my $string = 'abcdefgxxabcdefgzzabcdsjfhkdfab'; my %seen; 1 while $string =~ /(\w{2,}?)(?{$seen{$1}++})(?!)/g; print [sort { $seen{$b} <=> $seen{$a} \|\| length($b) <=> length($a) } keys %seen]->[0];` [download] Update: Thanks Athanasius for your appreciation && corrections.	[reply] [d/l]
Re^2: RegEx + vs. {1,} by Athanasius (Archbishop) on Oct 10, 2012 at 16:03 UTC
Thank-you for this! I would have forced backtracking by setting pos: `my %seen; for my $len (2 .. length $string) { ++$seen{$1}, pos($string) -= length($1) - 1 while $string =~ /(\w{ +$len})/g; }` [download] but this is clumsy and verbose beside your elegant one-liner! Took me a while to figure out what `(?!)` was doing. YAPE::Regex::Explain told me: `---------------------------------------------------------------------- (?{$seen{$1}++}) run this block of Perl code ---------------------------------------------------------------------- (?!) fail ----------------------------------------------------------------------` [download] and in Extended Patterns I found: `(FAIL) (F) This pattern matches nothing and always fails. It can be used to force + the engine to backtrack. It is equivalent to (?!), but easier to rea +d. In fact, (?!) gets optimised into (FAIL) internally.` [download] So I’ve learned something that’s really useful. Now a couple of minor quibbles: The non-greedy quantifier in `(\w{2,}?)` is redundant, since the forced backtracking ensures that all substrings will be found whether the search is greedy or not. If, as per your sort, you are looking for only the shortest* substring with the maximum frequency, then you need only search for substrings exactly 2 characters long (the minimum length): `(\w{2})`. (Proof: Say XYZ occurs N times. XY and YZ occur once in each occurrence of XYZ, so each is guaranteed to occur at least N times in the string. Similar reasoning applies to longer substrings.) ++trizen for demonstrating this elegant technique to force regex backtracking! Update: Because `(?!)` forces backtracking via repeatedly-failing matches, neither the `while` loop nor the `/g` modifier are necessary: `$string =~ /(\w{2,})(?{$seen{$1}++})(?!)/;` [download] does the identical job! (as is easily confirmed by dumping the contents of `%seen`). Athanasius <°(((>< contra mundum	[reply] [d/l] [select]
Re: RegEx + vs. {1,} by MidLifeXis (Monsignor) on Oct 10, 2012 at 12:32 UTC
Regexp::Debugger might be helpful to visualize what is happening here. --MidLifeXis	[reply]
Re: RegEx + vs. {1,} by Anique (Acolyte) on Oct 10, 2012 at 12:26 UTC
The problem is not with the `(\w{2,}?)`, because it will happily match to `ab` in `$x =~ /((\w{2,}?).\g2.?)+/;` and `$x =~ /((\w{2,}?).?\g2.?)+?/;` You are asking the regex to give you a sequence which is repeated, and you can either get the longest match (greedy) or the shortest one (ungreedy) The only solution for your problem that I can think of, is to capture all matches that this regex finds (list context), then counting how often each captured sequence occurs, and selecting the one that occurs most often.	[reply] [d/l] [select]
Re: RegEx + vs. {1,} by ELISHEVA (Prior) on Oct 10, 2012 at 14:16 UTC
If you want a list of all two letter patterns that appear at least twice somewhere in your string, you need to make three changes to your regex. you need to make `(\w{2,})` non-greedy by adding a "?" to the end, e.g. `(\w{2,}?)`. you need to wrap what comes after `(\w{2,}?)` with a zero width lookahead group. Otherwise you will miss all the matches between the first and second occurrence of "ab" you need to handle repetitions of your regex slightly differently. Instead of `/( mumblefoo )+/` you need `/mumblefoo/g`. Using a + the way you did will only get you the last match found because each time the + causes the regex to repeat, it replaces the previous match. Taken together these changes will make your regex will look like this: `/(\w{2,}?)(?=.?\1)/g`: `print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab", "\n"; print "<" . join('\|',$x =~ /(\w{2,}?)(?=.?\1)/g) , ">\n"; #outputs: <ab\|cd\|ef\|ab\|cd\|ab>` [download] You can more info on zerolength lookaheads via the Extended Patterns section of the perlre manpage on perldoc	[reply] [d/l] [select]
Re^2: RegEx + vs. {1,} by choroba (Cardinal) on Oct 10, 2012 at 14:25 UTC
Or, just remove the comma from the quantifier: `\w{2}` لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^3: RegEx + vs. {1,} by tobyink (Canon) on Oct 10, 2012 at 22:08 UTC
Indeed. There can never be a three character sequence in your string which occurs more frequently than a two character sequence. (Because the three character sequence contains two two character sequences.) `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply]
Re^4: RegEx + vs. {1,} by grizzley (Chaplain) on Oct 11, 2012 at 07:32 UTC


We don't bite newbies here... much
	PerlMonks