Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Regex a little less greedy please

by martymart (Deacon)
on Mar 18, 2003 at 14:12 UTC ( [id://243977]=perlquestion: print w/replies, xml ) Need Help??

martymart has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks, I have a little app that uses a regular expression, the regex uses:
<APPEND.*>
The information I need from this is the result of the postmatch of this search. Trouble is the greedy quantifier. The expression is searching on a string like:
<APPEND changed_date="02-02-2003">This is sample text</APPEND>
I would like the postmatch to give me back::
This is sample text</APPEND>
Instead, I think its matching to the '>' at the end of the string. What I need is to be able to tell the regex that the first time it encounters a '>' that it has achieved its match, is this possible? I would appreciate any ideas you may have on this.
Martymart

Replies are listed 'Best First'.
Re: Regex a little less greedy please
by broquaint (Abbot) on Mar 18, 2003 at 14:19 UTC
    De-greedify that dot-star like so
    my $str = q[<APPEND changed_date="02-02-2003">This is sample text</APPEND>]; print "match: ", $str =~ m{ <APPEND (.*?) > }x, $/; print "post: ", $', $/; __output__ match: changed_date="02-02-2003" post: This is sample text</APPEND>
    Check out perlre for more info on perl's regex engine.
    HTH

    _________
    broquaint

Re: Regex a little less greedy please
by arturo (Vicar) on Mar 18, 2003 at 14:29 UTC

    From perlre :

    If you want it to match the minimum 
    number of times possible, follow the 
    quantifier with a "?".  Note that the 
    meanings don’t change, just the 
    "greediness"
    

    So, changing your regex to

    <APPEND.*?>
    Should get you the behavior you want. If, however, you take to heart the lessons of Death to Dot Star!, you might want to write that this way:
    <APPEND[^>]*>
    Avoiding using the post-match variable and using () to capture the stuff you want to get is left as an exercise for the reader =)

    HTH

    If not P, what? Q maybe?
    "Sidney Morgenbesser"

Re: Regex a little less greedy please
by MZSanford (Curate) on Mar 18, 2003 at 14:16 UTC
    Parsing HTML/XML is somewhat tricky to do correctly (what with entities and all ... see Super Search for more info), but if you know that there will not be any >'s in the tag, you may want to use a regexp like ...
    m/<[^>]+>/

    from the frivolous to the serious
Re: Regex a little less greedy please
by roundboy (Sexton) on Mar 18, 2003 at 19:01 UTC
    In addition to using either the non-greedy quantifier (.*?) or skipping up to the next > ([^>]*), you also want to capture the text up through the matching end-tag, for which you just need a non-greedy quantifier inside capturing parens. So your regex should look like
    m{<APPEND\b[^>]*>(.*?)</APPEND>}

    This puts the text between the tags into $1; if you really want the ending tag, too, just move the paren. I added the /b to make sure you only match <APPEND> tags, and not, e.g., <APPENDIX>. The only caveats on this are:

    1. You might want to add a /i modifier to the match, in case someone adds the tags in lower case.
    2. If there's ever a chance of a '>' appearing in the attributes of the tag, you need something more complicated. The following (untested, but based on Friedl's Mastering Regular Expressions) should work:
      m{<APPEND\b(?:"[^"]*"|'[^']*'|[^'">])*>(.*?)</APPEND>}

    HTH,
    --roundboy

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://243977]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-04-16 21:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found