http://qs321.pair.com?node_id=845173

LaintalAy has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm on need of wisdom,

The problem is that I'm having problems trying to use regexes to retrieve matches that are defined as one/none or more. I'm trying with a simple file as example:

1 23 456 789 0123 456 2 24 456 789 0123 456 3 23 456 789 0123 456 4 23 456 789 0123 456 5 23 456 789 0123 456

And I was wondering if it's possible to parse and assign them to variables in just one step. The regex I was trying to use was something like

while (<$fd>) { my $regex = '^(?:(\d+)\s+)+(\d+)$'; (my ($d1, $d2, $d3, $d4, $d5, $d6) = $_) =~ m/$regex/; print "Line $.\n"; print "\t$d1\n"; }

It doesn't work as I'd expect. It matches, but only retrieves the last two elements because instead of getting an array of results for the (?: )+ regex part it stores only the last one.

I know a split would work without that much of a hassle but.. shouldn't be possible to do that just with a regex? I've tried different things without success and I haven't found any relevant example of this.

Thanks,

Replies are listed 'Best First'.
Re: Variable matching on a regex
by almut (Canon) on Jun 17, 2010 at 10:02 UTC

    Not an answer to your question, but

    (my ($d1, $d2, $d3, $d4, $d5, $d6) = $_) = m/$regex/;

    would more naturally be written as

    my ($d1, $d2, $d3, $d4, $d5, $d6) = /$regex/;

    which is the same as

    my ($d1, $d2, $d3, $d4, $d5, $d6) = $_ =~ /$regex/;

    (The temporary assignment of $_ to $d1 in your variant doesn't do any harm, but doesn't help much either.)


    Update: as for what you're attempting to do, IMHO, it would be a perfectly sensible thing to want to have as an option  (I also would have had uses for it occasionally).

    However, AFAIK, there is no way to do it, except if you implement it yourself with (?{...}) code or some such (as shown further down) — which of course ruins any elegance the approach might have had otherwise.

      Thanks a lot, you're right.
Re: Variable matching on a regex
by cdarke (Prior) on Jun 17, 2010 at 10:55 UTC
    Seems to me you are complicating matters because you consider that spaces follow each field except the last.
    So use zero or more spaces instead:
    my ($d1, $d2, $d3, $d4, $d5, $d6) = $_ =~ m/(\d+)\s*/g;
    Or use word boundaries:
    my ($d1, $d2, $d3, $d4, $d5, $d6) = $_ =~ m/\b(\d+)\b/g;

      OK, that works fine, but you're missing my point. That input is just an example, not an actual problem and I agree the regex I'm trying to use is overkilling.

      My question can be summarized on: Is it possible to capture a non fixed number of variables from a "fixed" regex? (without using /g feature). Maybe the answer is just "no", but I wanted to know.

      Cheers,

        You could adapt the code in this node, pushing captures onto an array rather than concatenating them onto a scalar string. It uses regular expression recursion so there are actually two patterns involved rather than one "fixed" regex but the actual match is done just the once without a g flag. Obviously, the global match already shown is a much simpler solution.

        I hope this is of interest.

        Cheers,

        JohnGG

        Why do you want to avoid using /g in the first place?
        How might you possibly define what to capture without specifying all the options or repeating with /g?
        If you provide a pseudocode example, the monks can then come up with the closest real way to do it.

        PS: Whenever you think about declaring $d1, $d2, $d3, what you really want is @d and a more descriptive name.

        Is it possible to capture a non fixed number of variables from a "fixed" regex? (without using /g feature).

        Sort of:

        @m=(); 'abcdefghijklmnopqrstuvwxyz' =~ m[(?:(?=(..)(?{ push @m, $^N })).)+]; print for @m;; ab bc cd de ef fg gh hi ij jk kl lm mn no op pq qr rs st tu uv vw wx xy yz

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Maybe the answer is just "no"
        The answer is indeed "no".
Re: Variable matching on a regex
by eric256 (Parson) on Jun 17, 2010 at 14:23 UTC

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $data = '123 456 789 987 654 321'; my @results; $data =~ /(\d+\s*(?{push @results, $1 if defined $1}))+$/; push @results, $1; print Dumper(@results);

    Just messing around seeing what could be done. not really sure why the last match doesn't get included or why the first one is undef, but i think they are probably related. Perhaps i'm actually pushing the last match not the current match, in fact thats almost certainly whats happening, anyone know how to reference the current match in a code block?


    ___________
    Eric Hodges
      not really sure why the last match doesn't get included or why the first one is undef

      I think the capture group needs to be closed before being able to push its value.  This would work without further ado:

      $data =~ /^(?:(\d+)\s*(?{ push @results, $^N }))+$/;


      (Update)  P.S.: if you use this construct in a loop like in the OP's case, you need to declare the lexical @results outside of the loop for it to work properly.  I.e., while this is ok:

      my @results; while (<DATA>) { @results = (); /^(?:(\d+)\s*(?{ push @results, $^N }))+$/; print "line $.: ", join('-', @results), "\n"; } __DATA__ 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7

      output:

      line 1: 1-2-3-4-5 line 2: 2-3-4-5-6 line 3: 3-4-5-6-7

      the following would work only once:

      while (<DATA>) { my @results; /^(?:(\d+)\s*(?{ push @results, $^N }))+$/; print "line $.: ", join('-', @results), "\n"; }

      output:

      line 1: 1-2-3-4-5 line 2: line 3:


      (Fixed /^(:?... typo — thanks eric256! )

        Looks like minor typo at the start of the regex, but it works!

        $data =~ /(?:(\d+)\s*(?{push @results, $^N}))+$/;

        ___________
        Eric Hodges
Re: Variable matching on a regex
by furry_marmot (Pilgrim) on Jun 17, 2010 at 20:39 UTC

    If you're just trying to capture the numbers, then why not just do that?

    $s = '1 23 456 789 01 23 456'; my ($d1, $d2, $d3, $d4, $d5, $d6, $d7) = $s =~ m/(\d+)/g; print "$d1, $d2, $d3, $d4, $d5, $d6, $d7\n"; # Prints: 1, 23, 456, 789, 01, 23, 456
    or how about:
    @results = $s =~ m/(\d+)/g; $i = 1; print("\$d", $i++, ": $_\n") for @results; # Prints: # $d1: 1 # $d2: 23 # $d3: 456 # $d4: 789 # $d5: 01 # $d6: 23 # $d7: 456

    Some comments:

    (?:) is used to group without retaining the value. So whatever you match there won't be remembered.

    You grouped $_ with the my variables, which doesn't do any good.

    Read up on how regexes work. A regex will ALWAYS start trying to match at the beginning of a string, searching forward until it finds a match. If you anchor the match with ^, such as m/(^\d+)/, then your are saying to only match something at the beginning of the line. This is faster, such as searching for /^Subject:/m in a bunch of emails, because it will fail after every line that doesn't start with 'S' and move on to the next line. But it won't match "Subject:" anywhere else in the text. That's good in this example, but bad for the matches you're doing.

    The + and * modifiers are greedy, so if you try to match /(\d+)/, Perl will search forward to the first (or next if you're using /g) digit, and keep matching until there are no more digits.

    You're trying to match \s+, but you aren't keeping it, and you don't really need to anchor on it, so there's no point in capturing it.

    You can also match more complicated patterns and capture the results. Here I'm capturing groups of one or two digits that precede a group of 3 digits. I'm just using your data example, but it could be anything.

    $s = '1 23 456 789 01 23 456'; push @results, $1 while $s =~ /((?:\b\d{1,2} )+\b\d{3,})/g; print "Match: $_\n" for @results; # Prints: # Match: 1 23 456 # Match: 01 23 456

    Notice the use of

    (?:) within a capturing group, so that it won't be separately captured as $2.

    --marmot