Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

regex capturing problem

by pike (Monk)
on Mar 21, 2002 at 13:17 UTC ( [id://153307]=perlquestion: print w/replies, xml ) Need Help??

pike has asked for the wisdom of the Perl Monks concerning the following question:

I thought I understood regexes, but here is one I just can't figure out.

I wanted to break words at single quotes, so my idea was to use

@words = $word =~ /^(\w+')([\w-]+)$/

Works fine. Then I discovered that sometimes there are words with more than one ' in them, so I changed it to

@words = $word =~ /^(\w+')+([\w-]+)$/

inserted a + after the first group. I'd expect this to break e.g. "d'aujourd'hui" into "d':aujourd':hui", but what it does is it gives only the last two parts: "aujourd':hui".

Why?!?

pike

P.S. I know I could use split to get what I want:

@words = split /(?<=')(?!s$)/, $word

but I'd just like to know whats wrong with my regex...

Replies are listed 'Best First'.
Re: regex capturing problem
by erikharrison (Deacon) on Mar 21, 2002 at 13:33 UTC

    You can't use + on parens like that to "remember" any number of matches - that's one of the reasons to have split. As for why it only gives the last part, is that matches are greedy (unless you tell them not to be) and will get the largest match they can - here the end of your string.

    Cheers,
    Erik
      Although I suspect you know, for anybody else reading this a slight clarification:

      is that matches are greedy (unless you tell them not to be) and will get the largest match they can

      This is not strictly true with perls NFA based regex engine. They will match the leftmost longest match that they can. This doesnt mean the longest possible match as a DFA based regex engine (egrep) would provide. Thus

      "AAABBBBBBBBAAAAAAAAAAAA"=~m/A+|A+B+A+/;
      Will match "AAA" and not the entire string. But a DFA based regex engine would match the entire string.

      OTOH reversing the option

      "AAABBBBBBBBAAAAAAAAAAAA"=~m/A+B+A+|A+/;
      Would match the entire string using either engine.

      Yves / DeMerphq
      ---
      Writing a good benchmark isn't as easy it might look.

Re: regex capturing problem
by larryk (Friar) on Mar 21, 2002 at 13:43 UTC
    because your RE is looking for a "word" (alphanum chars) ending in an ' followed by a word (with possible hyphens) immediately followed by the end of line. this is not the case for "d'aujourd" as there is no end of line. the RE skips this and moves on to "ajourd'hui" which is followed by end of line - ergo a match.

    what I believe you want to do is match all wordparts followed by an ' or the end of line like so:

    @words = $word =~ /(\w+)['\z]?/g;
    hth
       larryk                                          
    perl -le "s,,reverse killer,e,y,rifle,lycra,,print"
    
      Doesn't really explain why I get only the last two parts as my regex started whith '^' and therefore can't just start matching in the middle of the string...

      But the correct answer has already been given (thanks, erikharrison). I just wasn't aware that ()+ gives you just one item, no matter how often it matched.

      pike

        The point is not whether it can start matching in the middle of the string but whether it can start _capturing_ in the middle of the string. By adding a + to the first part you allow the RE engine to effectively ignore the ^ anchor while still forcing it to make a match at the end of the string.
           larryk                                          
        perl -le "s,,reverse killer,e,y,rifle,lycra,,print"
        
        Actually, it matches all of them, but since it is repeating, it overwrites the first match:
        $ perl -lwe"q<d'aujourd'hui> =~ /^((\w+')(?{print$+}))+([\w-]+)$/" d' aujourd' $
            p
Re: regex capturing problem
by AidanLee (Chaplain) on Mar 21, 2002 at 13:51 UTC

    I suppose that if you really wanted to use a non-split solution, you could try something like this:

    #in scalar context: my @words =(); push @words, $1 while $word =~ /(\w+'?)/g; #in list context: my @words = ( $word =~ /(\w+'?)/g );
Re: regex capturing problem
by stephane (Monk) on Mar 21, 2002 at 13:41 UTC

    Check it again: your first idea doesnt work either for d'aujourd'hui (but does for aujourd'hui, unless I am myself doing some typos ;-)

    The problem there is that your regexp is not.. what should I say.. "recursive": /^(\w+')([\w-]+)$) can only return two matches (i.e.: $1 and $2 and not $1, $2, $3 to $n, etc)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://153307]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2024-04-18 11:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found