regex capturing problem

pike has asked for the wisdom of the Perl Monks concerning the following question:

I thought I understood regexes, but here is one I just can't figure out.

I wanted to break words at single quotes, so my idea was to use

@words = $word =~ /^(\w+')([\w-]+)$/

Works fine. Then I discovered that sometimes there are words with more than one ' in them, so I changed it to

@words = $word =~ /^(\w+')+([\w-]+)$/

inserted a + after the first group. I'd expect this to break e.g. "d'aujourd'hui" into "d':aujourd':hui", but what it does is it gives only the last two parts: "aujourd':hui".

Why?!?

pike

P.S. I know I could use split to get what I want:

@words = split /(?<=')(?!s$)/, $word

but I'd just like to know whats wrong with my regex...

Comment on regex capturing problem Select or Download Code

Replies are listed 'Best First'.
Re: regex capturing problem by erikharrison (Deacon) on Mar 21, 2002 at 13:33 UTC
You can't use + on parens like that to "remember" any number of matches - that's one of the reasons to have `split`. As for why it only gives the last part, is that matches are greedy (unless you tell them not to be) and will get the largest match they can - here the end of your string. Cheers, Erik	[reply] [d/l]
Re: Re: regex capturing problem by demerphq (Chancellor) on Mar 21, 2002 at 15:16 UTC
Although I suspect you know, for anybody else reading this a slight clarification: is that matches are greedy (unless you tell them not to be) and will get the largest match they can This is not strictly true with perls NFA based regex engine. They will match the leftmost longest match that they can. This doesnt mean the longest possible match as a DFA based regex engine (egrep) would provide. Thus `"AAABBBBBBBBAAAAAAAAAAAA"=~m/A+\|A+B+A+/;` [download] Will match "AAA" and not the entire string. But a DFA based regex engine would match the entire string. OTOH reversing the option `"AAABBBBBBBBAAAAAAAAAAAA"=~m/A+B+A+\|A+/;` [download] Would match the entire string using either engine. Yves / DeMerphq --- Writing a good benchmark isn't as easy it might look.	[reply] [d/l] [select]
Re: regex capturing problem by larryk (Friar) on Mar 21, 2002 at 13:43 UTC
because your RE is looking for a "word" (alphanum chars) ending in an ' followed by a word (with possible hyphens) immediately followed by the end of line. this is not the case for "d'aujourd" as there is no end of line. the RE skips this and moves on to "ajourd'hui" which is followed by end of line - ergo a match. what I believe you want to do is match all wordparts followed by an ' or the end of line like so: `@words = $word =~ /(\w+)['\z]?/g;` [download] hth larryk perl -le "s,,reverse killer,e,y,rifle,lycra,,print"	[reply] [d/l]
Re: Re: regex capturing problem by pike (Monk) on Mar 21, 2002 at 14:12 UTC
Doesn't really explain why I get only the last two parts as my regex started whith '^' and therefore can't just start matching in the middle of the string... But the correct answer has already been given (thanks, erikharrison). I just wasn't aware that ()+ gives you just one item, no matter how often it matched. pike	[reply]
Re: Re: Re: regex capturing problem by larryk (Friar) on Mar 21, 2002 at 16:23 UTC
The point is not whether it can start matching in the middle of the string but whether it can start _capturing_ in the middle of the string. By adding a + to the first part you allow the RE engine to effectively ignore the ^ anchor while still forcing it to make a match at the end of the string. larryk perl -le "s,,reverse killer,e,y,rifle,lycra,,print"	[reply]
Re: Re: Re: regex capturing problem by petral (Curate) on Mar 21, 2002 at 21:11 UTC
Actually, it matches all of them, but since it is repeating, it overwrites the first match: `$ perl -lwe"q<d'aujourd'hui> =~ /^((\w+')(?{print$+}))+([\w-]+)$/" d' aujourd' $` [download] p	[reply] [d/l]
Re: regex capturing problem by AidanLee (Chaplain) on Mar 21, 2002 at 13:51 UTC
I suppose that if you really wanted to use a non-split solution, you could try something like this: `#in scalar context: my @words =(); push @words, $1 while $word =~ /(\w+'?)/g; #in list context: my @words = ( $word =~ /(\w+'?)/g );` [download]	[reply] [d/l]
Re: regex capturing problem by stephane (Monk) on Mar 21, 2002 at 13:41 UTC
Check it again: your first idea doesnt work either for d'aujourd'hui (but does for aujourd'hui, unless I am myself doing some typos ;-) The problem there is that your regexp is not.. what should I say.. "recursive": /^(\w+')([\w-]+)$) can only return two matches (i.e.: $1 and $2 and not $1, $2, $3 to $n, etc)	[reply]


"be consistent"
	PerlMonks