minimal greed, revisited

gregor-e has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to extract a record from inside a multi-line scalar using a single regex, for example:

#! /usr/local/bin/perl

my $source = "
name: Schwank E. Schwagg
address: 123 Dirtweed Dr.
user: 1
name: Bizzi Buddi
address: 321 Grapevine Way
user: 2
name: Fernal Brimstone
address: 666 Lucifer Ln
user: 3
";

$source =~ m/.*?(name.*?user: 2)/s;
print "$1\n";
[download]

I'd like this example to extract the name/address record for user #2, Bizzi Buddi, so I attempt to grab everything from "name" to "user: 2". Unfortunately, this regex is not minimally greedy, despite the .*?, and ends up matching the preceeding record also, which contains "name". How can I create a regex that minimally matches /literal_1.*literal_2/ and guarantees that the first literal is not repeated within the wildcard?

Comment on minimal greed, revisited Download Code

Replies are listed 'Best First'.
Re: minimal greed by neshura (Chaplain) on Feb 16, 2000 at 01:12 UTC
i am not entirely sure, but a lookbehind requires fixed width. how about instead matching the generic name:user and testing for a user number? this is almost definitely not the most elegant hack in terms of a minimalist expression or efficiency, but i think it works. `while ($source =~ m/(name.?)(?=name)/sg) { if ($& =~ m/(.user: 2)/s) { print "$1\n"; } }` [download] e-mail neshura	[reply] [d/l]
Re: minimal greed by chromatic (Archbishop) on Feb 16, 2000 at 03:01 UTC
I think you have a misconception as to greediness. The underlying regex engine starts at the left side of a string, matches as far as it can, and then moves rightward, (character\|unit\|position) by $1. There is a negative lookahead regex modifier, though. See perlre for details. It's still probably not what you want. When you're dealing with data in a repeated format like this, something like a split might be more appropriate: `my @fields = split(/\w+: /, $source); my $field2 = join @fields[3 .. 5];` [download] That's not pretty, but it's more likely to get you to your solution way before you can craft a regex that'll do what you mean.	[reply] [d/l]
Re: minimal greed by japhy (Canon) on Feb 16, 2000 at 19:44 UTC
From what I've found in benchmarks, a positive look-ahead assertion: `$match =~ /($start .?) (?=$start)/xs;` [download] is just a HAIR slower than a negative look-ahead assertion: `$match =~ /($start (?: (?! $start ) .))/xs;` [download] The first one matches as little from $start as it can until it's followed by $start again. That would actually FAIL for the last record. I'd suggest changing it to `$match =~ /($start .*?) (?= $start \| \Z)/xs;` [download] Where \Z is the absolute-end-of-string anchor. The second one matches from $start as much as possible that isn't $start again. And I'd like to set the record straight on /s and /m (even though perlre does a good job). The /s modifier ONLY means that the . regex character matches EVERY character (including \n, which it usually doesn't). The /m modifier means that ^ matches the real beginning of a string, or immediately after a \n, and that $ matches immediately before a newline, or at the real end of a string. Normally, the ^ anchor only matches at the REAL beginning of a string, and the $ anchor only matches the REAL end of a string, or immediately before a newline at the REAL end of a string. When using /m, the \A anchor matches the REAL beginning of the string, and the \Z anchor matches the REAL end of the string.	[reply] [d/l] [select]
Re: minimal greed by Crulx (Monk) on Feb 16, 2000 at 14:48 UTC
If reordering the records is within your power, put the user number at the top of the record. Then you could simply do `$source =~ m/.(user: 2\sname: .* address: .?\n)/s;` [download] Or something like that... I'm way to tired to get it right. grin* --- Crulx crulx@iaxs.net	[reply] [d/l]
Re: minimal greed by jcouto (Novice) on Feb 16, 2000 at 15:27 UTC
chromatic is 100% right about the greediness part. I did some testing (I cant say I really master regex in Perl, I just hammer them till they work :-) and this does what you want: `@s=split /^\s+user: \d+\n/sm, $source;` Translated to English, that is "split $source at each line starting with an number of spaces, followed by user: and a number and a newline", using the s and m modifiers to read $source as a multiline string but matching the ^ at the beggining of each line Hope it helps.	[reply] [d/l]


Keep It Simple, Stupid
	PerlMonks