Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Rewrite for "How to get ($1, $2, ...)?

by ferreira (Chaplain)
on Feb 16, 2007 at 18:31 UTC ( [id://600499]=note: print w/replies, xml ) Need Help??


in reply to How to get ($1, $2, ...)?

Update: the question was overhauled to emphasize the main topic which is the node title and because my first posting caused much more confusion than it should. As the question had already many replies and votes, I posted it here as a reply following the advices of other monks.

I am looking for a solution for the following problem: given an arbitrary regex (like qr/Title: (.*?), Author: (\w+) (\w+)$/) with an arbitrary number of groups (not known beforehand), how do I get ($1, $2, ...) in a generic way?

I envisaged a solution using @- and @+ and wrote the following piece of code. (See perlvar.)

# return ($1, $2, ...) matched against $s sub _groups { my $s = shift; my @groups; foreach my $i (1..$#-) { push @groups, substr($s, $-[$i], $+[$i] - $-[$i]); } return @groups }

Then I can write:

if (/$re/mgc) { @groups = _groups($_); # ($1, $2, ...) }

The question is: There is a better way to do this?

Background

Why, for Heaven's sake, I think I need to get these ($1, $2, ...)?

Read more if you care.

I am writing a code to extract pieces from a larger text in a flexible way. This is to be accomplished by a data-driven approach, based on a set of regexes.

For example, it must be capable of extract the title, author and publisher out of this snippet and in the right order.

Title: The Moor's Last Sigh Author: Salman Rushdie Publisher: Foo Title: The God of Small Things Author: Arundhati Roy Publisher: Bar

(Note. The input text is not supposed to be so nice like this example all the time — there may be gobs of stuff to be ignored/skipped in between the information that matters, like tags, whitespace, etc.)

As a simplified application of this, I wrote a code that looks like:

my $text = THE EXAMPLE TEXT ABOVE ... my $re_title = qr/Title: (.*?)$/; my $re_author = qr/Author: (\w+) (\w+)$/; my $re_publisher = qr/Publisher: (.*?)$/; my @answers; { my %book; if ($text =~ /$re_title/mgc) { $book{title} = $1; } if ($text =~ /$re_author/mgc) { $book{author} = [ $1, $2 ]; } if ($text =~ /$re_publisher/mgc) { $book{publisher} = $1; } push @answers, \%book; } { my %book; if ($text =~ /$re_title/mgc) { $book{title} = $1; } ...

(Note. The code is not meant to be a maintenance nightmare like the piece above. This piece is weird with detached regexes because it will be abstracted with those regexes and some control flow coming from data structures. What will remain is how the text is processed.)

The main issue here is that the modifier /gc is used to get the scanner behavior mentioned in Regexp Quote Like Operators. With it, after a match, it is possible to resume the scan from the point where the last regex left. It also avoids to build a complex regex which is going to be even more complex when I depart from this simplified approach of matching regexes in sequences to start implementing things like loops, conditionals and alternations.

The problem is that to get all captured groups, I cannot call $text =~ /$re/mgc in list context, or /g will create a loop and consume more ouput than I would like it did. For example, with the example above and

if (@groups = $text =~ /Title: (.*?)$/mgc) { $book{title} = $1; }

The array @groups will hold ( 'The Moor's Last Sigh', 'The God of Small Things' ) and leave pos($text) right before Author: Arundhati Roy (and then Salman Rushdie would be lost :). So I will have to call $text =~ /$re/mgc in a scalar context to get the scanner-like behavior and I found wanting a way to get all the groups for an arbitrary regex. So that's the reason of this question.

Note 1. Before the rephrasing of this question, educated_foo answered with a nice alternative (at Re: How to get ($1, $2, ...)?) for _groups and almut proposed a two-step process (at Re: How to get ($1, $2, ...)?) also in line with the node problem. I thank all other mongers that replied and eric256 that inspired me to rewrite this question.

Note 2. Yeah, there are modules like Text::Scraper, Text::Template to things like that, but they are not quite the same. Sometimes one needs to try to reinvent some wheels, even if it is just to have confidence on the wheels someone else made.

Note 3. demerphq pointed there is no way to do that in current production perls. Only in blead or with a little XS for earlier versions. The best thing he think of without using XS is: my @array=eval '($'.join(',$',1..$#-).')'; Thanks.

Replies are listed 'Best First'.
Re: Rewrite for "How to get ($1, $2, ...)?
by eric256 (Parson) on Feb 16, 2007 at 19:07 UTC

    I don't know if this will help or not, but if you split it line by line and then trigger a new book every time you see the title, you get the same scanner like behavior without using /mgc. If you might have multiple fields in one line then it might be hard to use, but maybe some combination of the two methods would let you find a book and then use the @fields = $str =~ /$re/mgs code on just one book section at a time. For all I know you might be able to split on a boundary before "Title" and then have each book as a chunk to then run your multiple regexs on without fear of them leaking over to the next book. Good Luck! ;)

    use strict; use warnings; use Data::Dumper; my $test =<<HERE; Title: The Moor's Last Sigh Author: Salman Rushdie Publisher: Foo asdf asdf asdf a d f d a sf as Title: The God of Small Things Author: Arundhati Roy Publisher: Bar HERE my @lines = split /\n/, $test; my $re_title = qr/Title: (.*?)$/; my $re_author = qr/Author: (\w+) (\w+)$/; my $re_publisher = qr/Publisher: (.*?)$/; my @answers; my $book; for my $line (@lines) { if ($line =~ /$re_title/) { #if this is a title line then the previous book is done being +scanned # so push the previous book onto answers and and clear out %bo +ok push @answers, $book if $book; $book = {}; $book->{title} = $1; } elsif ($line =~ /$re_author/) { $book->{author} = [ $1, $2 ]; } elsif ($line =~ /$re_publisher/) { $book->{publisher} = $1; } } #push the final book push @answers, $book; print Dumper(\@answers);

    ___________
    Eric Hodges

      This is kinda a fun little challenge and so I played with it more. Below is a mix of several different solutions here, i know they probably wont help you but since i've worked it out and tested it some i figured i'd share:

      use strict; use warnings; use Data::Dumper; my $test =<<HERE; Title: The Moor's Last Sigh Author: Salman Rushdie Publisher: Foo Title: The God of Small Things Author: Arundhati Roy Publisher: Bar Title: The Moor's Last Sigh, Author: Salman Rushdie HERE my @books = split /(?=Title:)/, $test; my @res = ( [ qr/^Title: (.*?), Author: (\w+) (\w+), Publisher: (\w+), Year: (\w ++)$/, sub { my $b = shift; $b->{title} = $1; $b->{author} = [$2,$3]; $b->{publisher} = $4; $b->{year} = $5; } ], [ qr/^Title: (.*?), Author: (\w+) (\w+), Publisher: (\w+)$/, sub { my $b = shift; $b->{title} = $1; $b->{author} = [$2,$3]; $b->{publisher} = $4 } ], [ qr/^Title: (.*?), Author: (\w+) (\w+)$/, + sub { my $b = shift; $b->{title} = $1; $b->{author} = [$2,$3] } ], [ qr/^Title: (.*?)$/, sub { my $b = shift; $b->{title} = $ +1; } ], [ qr/^Author: (\w+) (\w+)$/, sub { my $b = shift; $b->{author} = [ +$1,$2];}], [ qr/^Publisher: (.*?)$/, sub { my $b = shift; $b->{publisher}= $ +1; } ], ); my @answers; for my $book_src (@books) { my $book = {}; for my $re (@res) { my $reg = $re->[0]; if ($book_src =~ /$reg/mgc){ &{$re->[1]}($book); } } push @answers, $book; } print Dumper(\@answers); __END__ $VAR1 = [ { 'author' => [ 'Salman', 'Rushdie' ], 'title' => 'The Moor\'s Last Sigh', 'publisher' => 'Foo' }, { 'author' => [ 'Arundhati', 'Roy' ], 'title' => 'The God of Small Things', 'publisher' => 'Bar' }, { 'author' => [ 'Salman', 'Rushdie' ], 'title' => 'The Moor\'s Last Sigh' } ];

      ___________
      Eric Hodges
Re: Rewrite for "How to get ($1, $2, ...)?
by throop (Chaplain) on Feb 17, 2007 at 23:43 UTC
    Even after reading and re-reading your Update, I still don't quite get your problem.
    1. Do your patterns span multiple lines, or must each pattern match within a single line?
    2. Do your patterns always start matching at the beginning of a line, or do you sometimes pickup a pattern halfway through a line?
    If the patterns can span multiple lines, when you apply the first pattern, it's may skip over text that will match the second (third, fourth...) pattern. Won't it?

    Also, unless the patterns are contstrained to start matching at the beginning of a line, it sounds like you're heading for a wildly inefficient and slow routine. Or am I misunderstanding you?

    throop

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://600499]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-03-28 20:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found