Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

minimal greed, revisited

by gregor-e (Beadle)
on Feb 15, 2000 at 23:54 UTC ( [id://3540]=perlquestion: print w/replies, xml ) Need Help??

gregor-e has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to extract a record from inside a multi-line scalar using a single regex, for example:
#! /usr/local/bin/perl my $source = " name: Schwank E. Schwagg address: 123 Dirtweed Dr. user: 1 name: Bizzi Buddi address: 321 Grapevine Way user: 2 name: Fernal Brimstone address: 666 Lucifer Ln user: 3 "; $source =~ m/.*?(name.*?user: 2)/s; print "$1\n";
I'd like this example to extract the name/address record for user #2, Bizzi Buddi, so I attempt to grab everything from "name" to "user: 2". Unfortunately, this regex is not minimally greedy, despite the .*?, and ends up matching the preceeding record also, which contains "name". How can I create a regex that minimally matches /literal_1.*literal_2/ and guarantees that the first literal is not repeated within the wildcard?

Replies are listed 'Best First'.
Re: minimal greed
by neshura (Chaplain) on Feb 16, 2000 at 01:12 UTC
    i am not entirely sure, but a lookbehind requires fixed width. how about instead matching the generic name:user and testing for a user number? this is almost definitely not the most elegant hack in terms of a minimalist expression or efficiency, but i think it works.
    while ($source =~ m/(name.*?)(?=name)/sg) { if ($& =~ m/(.*user: 2)/s) { print "$1\n"; } }

    e-mail neshura

Re: minimal greed
by chromatic (Archbishop) on Feb 16, 2000 at 03:01 UTC
    I think you have a misconception as to greediness. The underlying regex engine starts at the left side of a string, matches as far as it can, and then moves rightward, (character|unit|position) by $1. There is a negative lookahead regex modifier, though. See perlre for details. It's still probably not what you want.

    When you're dealing with data in a repeated format like this, something like a split might be more appropriate:

    my @fields = split(/\w+: /, $source); my $field2 = join @fields[3 .. 5];
    That's not pretty, but it's more likely to get you to your solution way before you can craft a regex that'll do what you mean.
Re: minimal greed
by japhy (Canon) on Feb 16, 2000 at 19:44 UTC
    From what I've found in benchmarks, a positive look-ahead assertion:
    $match =~ /($start .*?) (?=$start)/xs;
    is just a HAIR slower than a negative look-ahead assertion:
    $match =~ /($start (?: (?! $start ) .)*)/xs;
    The first one matches as little from $start as it can until it's followed by $start again. That would actually FAIL for the last record. I'd suggest changing it to
    $match =~ /($start .*?) (?= $start | \Z)/xs;
    Where \Z is the absolute-end-of-string anchor. The second one matches from $start as much as possible that isn't $start again.

    And I'd like to set the record straight on /s and /m (even though perlre does a good job). The /s modifier ONLY means that the . regex character matches EVERY character (including \n, which it usually doesn't). The /m modifier means that ^ matches the real beginning of a string, or immediately after a \n, and that $ matches immediately before a newline, or at the real end of a string. Normally, the ^ anchor only matches at the REAL beginning of a string, and the $ anchor only matches the REAL end of a string, or immediately before a newline at the REAL end of a string. When using /m, the \A anchor matches the REAL beginning of the string, and the \Z anchor matches the REAL end of the string.
Re: minimal greed
by Crulx (Monk) on Feb 16, 2000 at 14:48 UTC
    If reordering the records is within your power, put the user number at the top of the record. Then you could simply do
    $source =~ m/.*(user: 2\s*name: .* address: .*?\n)/s;
    Or something like that... I'm way to tired to get it right. *grin*
    ---
    Crulx
    crulx@iaxs.net
Re: minimal greed
by jcouto (Novice) on Feb 16, 2000 at 15:27 UTC

    chromatic is 100% right about the greediness part.

    I did some testing (I cant say I really master regex in Perl, I just hammer them till they work :-) and this does what you want: @s=split /^\s+user: \d+\n/sm, $source; Translated to English, that is "split $source at each line starting with an number of spaces, followed by user: and a number and a newline", using the s and m modifiers to read $source as a multiline string but matching the ^ at the beggining of each line

    Hope it helps.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://3540]
Approved by erzuuli
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2024-04-25 08:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found