Just another Perl shrine | |
PerlMonks |
Rewrite for "How to get ($1, $2, ...)?by ferreira (Chaplain) |
on Feb 16, 2007 at 18:31 UTC ( [id://600499]=note: print w/replies, xml ) | Need Help?? |
Update: the question was overhauled to emphasize the main topic which is the node title and because my first posting caused much more confusion than it should. As the question had already many replies and votes, I posted it here as a reply following the advices of other monks. I am looking for a solution for the following problem: given an arbitrary regex (like qr/Title: (.*?), Author: (\w+) (\w+)$/) with an arbitrary number of groups (not known beforehand), how do I get ($1, $2, ...) in a generic way? I envisaged a solution using @- and @+ and wrote the following piece of code. (See perlvar.)
Then I can write:
The question is: There is a better way to do this? BackgroundWhy, for Heaven's sake, I think I need to get these ($1, $2, ...)? Read more if you care. I am writing a code to extract pieces from a larger text in a flexible way. This is to be accomplished by a data-driven approach, based on a set of regexes. For example, it must be capable of extract the title, author and publisher out of this snippet and in the right order.
(Note. The input text is not supposed to be so nice like this example all the time — there may be gobs of stuff to be ignored/skipped in between the information that matters, like tags, whitespace, etc.) As a simplified application of this, I wrote a code that looks like:
(Note. The code is not meant to be a maintenance nightmare like the piece above. This piece is weird with detached regexes because it will be abstracted with those regexes and some control flow coming from data structures. What will remain is how the text is processed.) The main issue here is that the modifier /gc is used to get the scanner behavior mentioned in Regexp Quote Like Operators. With it, after a match, it is possible to resume the scan from the point where the last regex left. It also avoids to build a complex regex which is going to be even more complex when I depart from this simplified approach of matching regexes in sequences to start implementing things like loops, conditionals and alternations. The problem is that to get all captured groups, I cannot call $text =~ /$re/mgc in list context, or /g will create a loop and consume more ouput than I would like it did. For example, with the example above and
The array @groups will hold ( 'The Moor's Last Sigh', 'The God of Small Things' ) and leave pos($text) right before Author: Arundhati Roy (and then Salman Rushdie would be lost :). So I will have to call $text =~ /$re/mgc in a scalar context to get the scanner-like behavior and I found wanting a way to get all the groups for an arbitrary regex. So that's the reason of this question. Note 1. Before the rephrasing of this question, educated_foo answered with a nice alternative (at Re: How to get ($1, $2, ...)?) for _groups and almut proposed a two-step process (at Re: How to get ($1, $2, ...)?) also in line with the node problem. I thank all other mongers that replied and eric256 that inspired me to rewrite this question. Note 2. Yeah, there are modules like Text::Scraper, Text::Template to things like that, but they are not quite the same. Sometimes one needs to try to reinvent some wheels, even if it is just to have confidence on the wheels someone else made. Note 3. demerphq pointed there is no way to do that in current production perls. Only in blead or with a little XS for earlier versions. The best thing he think of without using XS is: my @array=eval '($'.join(',$',1..$#-).')'; Thanks.
In Section
Seekers of Perl Wisdom
|
|