Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

More Variable length regex issues

by dextius (Monk)
on Jun 08, 2003 at 23:18 UTC ( #264179=perlquestion: print w/replies, xml ) Need Help??

dextius has asked for the wisdom of the Perl Monks concerning the following question:

Caveat: Please do not offer split as my solution, this is geared more as a regular expression issue...

Let's use CSV as an example..

my $foo = "foo,bar,moo,cow"; $foo =~ /(?:([^,]+),)+/;

The above will only set $1 to "moo".. $2,$3,$4 are all null, $' is "cow", and $& is "foo,bar,moo"

I thought maybe the @- or @+ arrays might contain all of the matches, but to no avail...

What is strange is..

my $mystr = "foo,bar,moo,cow"; my @values = $mystr =~ m/(\w+)\,?/g;

@values will have all the matches I wanted.. But again, $1 will be set, and $2, $3, $4 will all be null.. Are capturing parentheses locked to $1 if matched multiple times? Is there any way around this? (without using split?)

Advance <- Thanks

Replies are listed 'Best First'.
Re: More Variable length regex issues
by pzbagel (Chaplain) on Jun 08, 2003 at 23:55 UTC

    Count the number of capturing parenthese in your regex's. There's one, hence only $1 gets set. Remember /g simply performs the same regex over and over on the string.

    Now a question to you, why are you looking for $2-$4 being set when you should have all the values in the @values array? Why not just reference them there?

    HTH

      Well, if you KNOW there will always be 4 fields, then hardcode that in your regex:

      m/^([^,]*),([^,]*),([^,]*),([^,]*)$/;

      A warning: that regex wouldn't work right if there was comma embedded in quotes in one of the fields. I'd recommend using the Text::CSV_XS module, but as you state that you aren't actually using Perl...

      Later

        Yeah, the whole "variable length" part of the string is the problem. Thanks for your help..
      Two reasons..

      1. I don't fully understand the match operator (even after reading Mastering Regex v2 twice). I'm trying to learn more about how those variables are populated in Perl, compared to Java, or Python...

      2. I'm not exactly using Perl right now. But since I couldn't solve this problem using a Perl regex, it seemed like a decent enough question..

        You don't mention why split is not an option. I am guessing because you aren't just trying to split a string on some delimiter, you are trying to learn regexes, and this is a problem you feel comfortable with. There is nothing wrong with that, but realize that what pzbagel said was your answer ... your values are stored in the array. What you are trying to do - match some arbitrary numbers of items and populate $1 through $N inside the match operator just doesn't make sense to me. I mean, that's what the g modifier is for ... match all occurances, no matter how many you find.

        You hint and Java and Python, but you don't specify what language you are really trying to solve this problem in. If i had to guess, i would say you are using PHP or some Java library that modeled itself against Perl's regexes. Can't help you with the Java stuff, but if it's PHP you are using, then try preg_match_all(). It is like preg_match with Perl's g match modifer, but it's usage is a bit tricky:
        <?php $mystr = 'foo,bar,moo,cow'; preg_match_all('/(\w+)\,?/',$mystr,$matches); ?> <ul> <?php foreach ($matches[1] as $match) { ?> <li><?=$match?></li> <?php } ?> </ul>
        If you are using Python, then you can use the exact same regex with Python's re.findall():
        #!/usr/bin/python from re import findall mystr = 'foo,bar,moo,cow' values = findall('(\w+)\,?',mystr) for val in values: print val
        Hope this helps, i feel kinda dirty now ... PHP and Python at a Perl site! ;)

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        
      I guess I'm just looking for something equivalent to "what was matched", as a structure. But not having to catch the other side of the regex as it matches...
Re: More Variable length regex issues
by BrowserUk (Patriarch) on Jun 09, 2003 at 06:05 UTC

    I too have often wished that capturing brackets inside a repeat group would capture to successively higher $n vars.

    Actually, I wish that all the captures were made available via a magic array -- @^N seems a likely candidate given recent enhancements to the regex engine -- and that repeat group captures worked logically.

    What you seem to want to do is to parse something like this with a regex

    a fixed bit: a,variable,length, repeated, bit [some more fixed stuff] more fixed: more,variable,stuff [more fixed]

    A repeat group allows you match this easily enough, but trying to capture all of the individual bits at the same time isn't. Which is a pain.

    I think that probably the simplest (and probably most portable) way of doing this is to capture the variable bit to a single$n var on the first pass and break out the individual bits from there

    while( my $data = <DATA> ) { $data =~ m[^ ( [\w\s]+ ) : ( [^\x5b]+ ) \x5b ( [^\x5d]+ ) \x5d ]x; my ($first_bit, $last_bit) =( $1, $3 ); my @variable_bits = $2 =~ m[(\w+)[,\s]]g; print "$first_bit: (@variable_bits) [$last_bit]"; }

    That said, if you were using Perl 5.6(?) or later, then there is another way of doing this:

    #! perl -slw use strict; use re 'eval'; our ($num, $firstwords, $bracketed, $label, @bits, $pre_bit, $in_bit, +$post_bit); my $re = qr[ (?{ our($num, $firstwords, $bracketed, $label, $pre_bit, $in_bit, +$post_bit, @bits) = ( (undef) x 7, () ); }) (\d+) : (?{ our $num = $^N }) ([^\x5b]+?) \x5b (?{ our $firstwords = $^N }) ([^\x5d]+?) \x5d (?{ our $bracketed = $^N }) ([^:]+) : \s* (?{ our $label = $^N }) (?x-ism: ( [^,\s]+? ) [,\s] (?{ push our @bits, $^N }) ) ++? \s* \x5b (\w+) \( (?{ our $pre_bit = $^N }) ([\w ]+) \) (?{ our $in_bit = $^N }) (\w+) (?{ our $post_bit = $^N }) \x5d ]x; while( <DATA> ) { print "$num : $firstwords [ $bracketed ] $label : [@bits] [ $pre_b +it ( $in_bit ) $post_bit ]" if $_ =~ $re; } __DATA__ 1: or more [semi-fixed] fields: and,some,variable,length,stuff [more(f +ixed)stuff] 2: kind of [similarly] formated: records,with,variable,differences [em +bedded(in the)records]

    I know this feature is still labelled 'experimental', but I'd be surprised if it goes away. It seems really useful to me, but I doubt it has made it into many of the perl regex clones yet?

    Whether this is worth the effort to avoid the second regex is doubtful for the simple instances shown, but on more complicated records, this ability to capture disperate and variable parts directly into named (even if global) vars has distinct advantages.

    Note: My use of \x5b & \x5d isn't an affectation. There seems to be a bug in the regex engine (5.8 at least) the means that using m[ ( [^[]+ ) \[ ]x; or m[ ( [^]]+ ) \] ]x; (which I think ought to work) confuses the regex engine. This is true even if I escape the '[' and ']' within the character classes. Interestingly, it complains that the parens are unbalanced. I haven't tied down the exact circumstances yet, but if anyone else has encountered a similar problem I'd be interested in hearing from them.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      The reason for your bug (in your Note:) is because when Perl is FIRST parsing your code, and it tries to determine where your regex starts and ends, it only looks for balanced square brackets. At that stage, it's not actually parsing your regex, just looking for its start and end. Thus, the square brackets IN the regex that aren't backslashed throw the parser off.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        Thanks for that. So the answer is, don't use square brackets as delimiters if the regex contains (unbalanced) square brackets.

        It's a shame that there aren't a couple more sets of balanced brackets in the arsenal:) Preferably a pair that could be used soley for quote-like delimiting. Maybe now we have unicode, we could find a pairing that wouldn't get overloaded for 7 other things too? Some chance I think:)


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://264179]
Approved by sauoq
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (1)
As of 2022-05-18 04:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (68 votes). Check out past polls.

    Notices?