comment on

I too have often wished that capturing brackets inside a repeat group would capture to successively higher $n vars.

Actually, I wish that all the captures were made available via a magic array -- @^N seems a likely candidate given recent enhancements to the regex engine -- and that repeat group captures worked logically.

What you seem to want to do is to parse something like this with a regex

a fixed bit: a,variable,length, repeated, bit [some more fixed stuff]
more fixed: more,variable,stuff [more fixed]
[download]

A repeat group allows you match this easily enough, but trying to capture all of the individual bits at the same time isn't. Which is a pain.

I think that probably the simplest (and probably most portable) way of doing this is to capture the variable bit to a single$n var on the first pass and break out the individual bits from there

while( my $data = <DATA> ) {
    $data =~ m[^
        ( [\w\s]+ ) :
        ( [^\x5b]+ )  \x5b
        ( [^\x5d]+ )  \x5d
    ]x;
    my ($first_bit, $last_bit) =( $1, $3 );
    my @variable_bits =  $2 =~ m[(\w+)[,\s]]g;
    print "$first_bit: (@variable_bits) [$last_bit]";
}
[download]

That said, if you were using Perl 5.6(?) or later, then there is another way of doing this:

#! perl -slw
use strict;
use re 'eval';

our ($num, $firstwords, $bracketed, $label, @bits, $pre_bit, $in_bit, 
+$post_bit);

my $re = qr[
    (?{
        our($num, $firstwords, $bracketed, $label, $pre_bit, $in_bit, 
+$post_bit, @bits)
            = ( (undef) x 7, () );
    })
    (\d+) :                             (?{ our $num        = $^N })
    ([^\x5b]+?) \x5b                    (?{ our $firstwords = $^N })
    ([^\x5d]+?) \x5d                    (?{ our $bracketed  = $^N })
    ([^:]+) : \s*                       (?{ our $label      = $^N })
    (?x-ism: ( [^,\s]+? ) [,\s]         (?{ push our @bits,   $^N }) )
++?
    \s* \x5b
        (\w+) \(                        (?{ our $pre_bit    = $^N })
        ([\w ]+) \)                     (?{ our $in_bit     = $^N })
        (\w+)                           (?{ our $post_bit   = $^N })
    \x5d
]x;

while( <DATA> ) {
    print "$num : $firstwords [ $bracketed ] $label : [@bits] [ $pre_b
+it ( $in_bit ) $post_bit ]"
        if $_ =~ $re;
}

__DATA__
1: or more [semi-fixed] fields: and,some,variable,length,stuff [more(f
+ixed)stuff]
2: kind of [similarly] formated: records,with,variable,differences [em
+bedded(in the)records]
[download]

I know this feature is still labelled 'experimental', but I'd be surprised if it goes away. It seems really useful to me, but I doubt it has made it into many of the perl regex clones yet?

Whether this is worth the effort to avoid the second regex is doubtful for the simple instances shown, but on more complicated records, this ability to capture disperate and variable parts directly into named (even if global) vars has distinct advantages.

Note: My use of \x5b & \x5d isn't an affectation. There seems to be a bug in the regex engine (5.8 at least) the means that using m[ ( [^[]+ ) \[ ]x; or m[ ( [^]]+ ) \] ]x; (which I think ought to work) confuses the regex engine. This is true even if I escape the '[' and ']' within the character classes. Interestingly, it complains that the parens are unbalanced. I haven't tied down the exact circumstances yet, but if anyone else has encountered a similar problem I'd be interested in hearing from them.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

In reply to Re: More Variable length regex issues by BrowserUk
in thread More Variable length regex issues by dextius

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl: the Markov chain saw
	PerlMonks