http://qs321.pair.com?node_id=1088268


in reply to how to extract string by possible groupings?

I think you are confused about how groupings work

/((.*\.c\s)|(.*\.h\s)|(.*\.cpp\s))|(\s+(.*)\%\s+(of+)\s+\d+\s)|(\bNone +\b)/g #01 2 3 4 5 6 7

each opening bracket starts a grouping. Groupings that don't match will be undef !

You can use extended regex (?:PATTERN) for clustering but not grouping to skip an index

update

... or even avoid (...) where you don't need any clustering at all (like in your or-branches).

Cheers Rolf

(addicted to the Perl Programming Language)

Replies are listed 'Best First'.
Re^2: how to extract string by possible groupings?
by davido (Cardinal) on Jun 02, 2014 at 23:16 UTC

    It's possibly that a capture group was missed in your explanation:

    /((.*\.c\s)|(.*\.h\s)|(.*\.cpp\s))|(\s+(.*)\%\s+(of+)\s+\d+\s)|(\bNone +­\b)/g #01 2 3 4 5 6 7

    If you lay that out using the /x modifier it becomes more obvious:

    / ( # 1 (.*\.c\s) # 2 | (.*\.h\s) # 3 | (.*\.cpp\s) # 4 ) | (\s+ # 5 (.*)\%\s+ # 6 (of+)\s+\d+\s # 7 ) | (\bNone­\b) # 8 /gx

    My preference would be to first reduce the capturing to just those parts that are needed. For example, it's unlikely that one would want both "1" and "2", "3", and "4". Likewise, it's unlikely that someone would care about "5" while also caring about "6", and "7".

    Second, resort to named captures: (?<somename>...). And third, to look at breaking it up into smaller problems with /g and \G

    I think, in particular, that named captures and (?:...) grouping where capturing isn't needed would make this easier to use.


    Dave

      ... named captures ...

      I think I would opt for a different course. Elaborating (well, second-guessing, really) on the example below, once you have validated a line , and given that the fields are completely mutually exclusive, the fields just pop out and go down as smoothly as oysters, with no capturing at all (update: no capturing to capture groups, that is).

      c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common; ;; my @lines = ( 'test1.cpp 0.00% of 21 0.00% of 16', 'test2.c None 16.53% of 484', 'test3.h 0.00% of 138 None', '/x/y/foo.c 0.00% of 1 None', ); ;; my $title = qr{ \w+ (?: [.] \w+)* }xms; my $percent = qr{ $RE{num}{real} % \s+ of \s+ \d+ }xms; my $none = qr{ None }xms; ;; for my $line (@lines) { print qq{line '$line'}; die qq{ BAD LINE: '$line'} unless $line =~ m{ \A $title (?: \s+ (?: $percent | $none)){2} \s* \z }xms; my ($t, $p1, $p2) = $line =~ m{ \A $title | $percent | $none }xmsg; print qq{ title: '$t' pcent1: '$p1' pcent2: '$p2'}; } " line 'test1.cpp 0.00% of 21 0.00% of 16' title: 'test1.cpp' pcent1: '0.00% of 21' pcent2: '0.00% of 16' line 'test2.c None 16.53% of 484' title: 'test2.c' pcent1: 'None' pcent2: '16.53% of 484' line 'test3.h 0.00% of 138 None' title: 'test3.h' pcent1: '0.00% of 138' pcent2: 'None' line '/x/y/foo.c 0.00% of 1 None' BAD LINE: '/x/y/foo.c 0.00% of 1 None' at -e line 1.

      Updates:

      1. Actually removed capturing groups from validation regex.
      2. It turns out the fields are not "completely mutually exclusive" as I originally claimed, so I had to change the extraction regex from
            m{ $title | $percent | $none }xmsg
        to
            m{ \A $title | $percent | $none }xmsg
        This somewhat vitiates the intended thrust of this post, but I think the main point stands. Oh, well...

      > It's possibly that a capture group was missed in your explanation:

      no, I started counting with 0 and you with 1.

      see Re^3: how to extract string by possible groupings? for why I did what I did! :)

      Cheers Rolf

      (addicted to the Perl Programming Language)

        Good call! :) As you probably guessed, I was considering \1, \2..., and their counterparts, $1, $2, etc.


        Dave

Re^2: how to extract string by possible groupings?
by AnomalousMonk (Archbishop) on Jun 02, 2014 at 23:14 UTC

    /((.*\.c\s)|(.*\.h\s)|(.*\.cpp\s))|(\s+(.*)\%\s+(of+)\s+\d+\s)|(\bNone\b)/g
    #01         2         3            4   5        6              7

    Capture group numbering begins at 1, not 0, so the capture group variables corresponding to the capturing groups in the example would be $1 .. $8. In the  @- and  @+ arrays, the offsets of the entire match are held at index 0. Otherwise,  $0 holds the script name. See Variables related to regular expressions and perlvar in general.

    Update: The [originally posted] question was for a match in list context ... which returns the matches as a list into an array. Quite right; my mistake.

      See OP

      The question was for a match in list context

      (@match) = ( $_ =~ /.../g )

      which returns the matches as a list into an array.

      i.e. $match[0]=$1 ( see perlop ¹ )

      I didn't want to confuse with more details than necessary...

      Cheers Rolf

      (addicted to the Perl Programming Language)

      update

      well actually the /g modifier isn't necessary and might produce too many matches...

      DB<110> @matches = ('abcd' =~ /(.)(.)/) => ("a", "b") DB<111> @matches = ('abcd' =~ /(.)(.)/g) => ("a", "b", "c", "d") DB<112> $matches[0] => "a"

      ¹) perlop#Regexp-Quote-Like-Operators

      * Matching in list context

      If the "/g" option is not used, "m//" in list context returns a list consisting of the subexpressions matched by the parentheses in the pattern, i.e., ($1, $2, $3...).

Re^2: how to extract string by possible groupings?
by muba (Priest) on Jun 02, 2014 at 18:48 UTC

    I tried to come up with a similar illustration of how grouping works but gave up after 10 minutes of coming up with nothing comprehendable. I think you managed to do it quite elegantly, for which ++.

      Thanks, but we answered this questions already so many times, I even doubt this visualization was originally my idea! :)

      Cheers Rolf

      (addicted to the Perl Programming Language)