how to extract string by possible groupings?

adrive has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to extract string by possible groupings? by LanX (Saint) on Jun 02, 2014 at 14:40 UTC
I think you are confused about how groupings work `/((.\.c\s)\|(.\.h\s)\|(.\.cpp\s))\|(\s+(.)\%\s+(of+)\s+\d+\s)\|(\bNone +\b)/g #01 2 3 4 5 6 7` [download] each opening bracket starts a grouping. Groupings that don't match will be `undef` ! You can use extended regex `(?:PATTERN)` for clustering but not grouping to skip an index update ... or even avoid `(...)` where you don't need any clustering at all (like in your or-branches). Cheers Rolf (addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re^2: how to extract string by possible groupings? by davido (Cardinal) on Jun 02, 2014 at 23:16 UTC
It's possibly that a capture group was missed in your explanation: `/((.\.c\s)\|(.\.h\s)\|(.\.cpp\s))\|(\s+(.)\%\s+(of+)\s+\d+\s)\|(\bNone +\b)/g #01 2 3 4 5 6 7` [download] If you lay that out using the /x modifier it becomes more obvious: `/ ( # 1 (.\.c\s) # 2 \| (.\.h\s) # 3 \| (.\.cpp\s) # 4 ) \| (\s+ # 5 (.)\%\s+ # 6 (of+)\s+\d+\s # 7 ) \| (\bNone\b) # 8 /gx` [download] My preference would be to first reduce the capturing to just those parts that are needed. For example, it's unlikely that one would want both "1" and "2", "3", and "4". Likewise, it's unlikely that someone would care about "5" while also caring about "6", and "7". Second, resort to named captures: `(?<somename>...)`. And third, to look at breaking it up into smaller problems with `/g` and `\G` I think, in particular, that named captures and `(?:...)` grouping where capturing isn't needed would make this easier to use. Dave	[reply] [d/l] [select]
Re^3: how to extract string by possible groupings? by AnomalousMonk (Archbishop) on Jun 02, 2014 at 23:43 UTC
... named captures ... I think I would opt for a different course. Elaborating (well, second-guessing, really) on the example below, once you have validated a line ~~, and given that the fields are completely mutually exclusive~~, the fields just pop out and go down as smoothly as oysters, with no capturing at all (update: no capturing to capture groups, that is). c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common; ;; my @lines = ( 'test1.cpp 0.00% of 21 0.00% of 16', 'test2.c None 16.53% of 484', 'test3.h 0.00% of 138 None', '/x/y/foo.c 0.00% of 1 None', ); ;; my $title = qr{ \w+ (?: [.] \w+)* }xms; my $percent = qr{ $RE{num}{real} % \s+ of \s+ \d+ }xms; my $none = qr{ None }xms; ;; for my $line (@lines) { print qq{line '$line'}; die qq{ BAD LINE: '$line'} unless $line =~ m{ \A $title (?: \s+ (?: $percent \| $none)){2} \s* \z }xms; my ($t, $p1, $p2) = $line =~ m{ \A $title \| $percent \| $none }xmsg; print qq{ title: '$t' pcent1: '$p1' pcent2: '$p2'}; } " line 'test1.cpp 0.00% of 21 0.00% of 16' title: 'test1.cpp' pcent1: '0.00% of 21' pcent2: '0.00% of 16' line 'test2.c None 16.53% of 484' title: 'test2.c' pcent1: 'None' pcent2: '16.53% of 484' line 'test3.h 0.00% of 138 None' title: 'test3.h' pcent1: '0.00% of 138' pcent2: 'None' line '/x/y/foo.c 0.00% of 1 None' BAD LINE: '/x/y/foo.c 0.00% of 1 None' at -e line 1. [download] Updates: Actually removed capturing groups from validation regex. It turns out the fields are not "completely mutually exclusive" as I originally claimed, so I had to change the extraction regex from `m{ $title \| $percent \| $none }xmsg` to `m{ \A $title \| $percent \| $none }xmsg` This somewhat vitiates the intended thrust of this post, but I think the main point stands. Oh, well...	[reply] [d/l] [select]
Re^3: how to extract string by possible groupings? by LanX (Saint) on Jun 02, 2014 at 23:26 UTC
> It's possibly that a capture group was missed in your explanation: no, I started counting with 0 and you with 1. see Re^3: how to extract string by possible groupings? for why I did what I did! :) Cheers Rolf (addicted to the Perl Programming Language)	[reply]
Re^4: how to extract string by possible groupings? by davido (Cardinal) on Jun 02, 2014 at 23:57 UTC
Re^2: how to extract string by possible groupings? by AnomalousMonk (Archbishop) on Jun 02, 2014 at 23:14 UTC
`/((.\.c\s)\|(.\.h\s)\|(.\.cpp\s))\|(\s+(.)\%\s+(of+)\s+\d+\s)\|(\bNone\b)/g` `#01 2 3 4 5 6 7` Capture group numbering begins at 1, not 0, so the capture group variables corresponding to the capturing groups in the example would be `$1 .. $8`. In the `@-` and `@+` arrays, the offsets of the entire match are held at index 0. Otherwise, `$0` holds the script name. See Variables related to regular expressions and perlvar in general. Update: The [originally posted] question was for a match in list context ... which returns the matches as a list into an array. Quite right; my mistake.	[reply] [d/l] [select]
Re^3: how to extract string by possible groupings? by LanX (Saint) on Jun 02, 2014 at 23:22 UTC
See OP The question was for a match in list context `(@match) = ( $_ =~ /.../g )` which returns the matches as a list into an array. i.e. `$match[0]=$1` ( see `perlop` ¹ ) I didn't want to confuse with more details than necessary... Cheers Rolf (addicted to the Perl Programming Language) update well actually the /g modifier isn't necessary and might produce too many matches... `DB<110> @matches = ('abcd' =~ /(.)(.)/) => ("a", "b") DB<111> @matches = ('abcd' =~ /(.)(.)/g) => ("a", "b", "c", "d") DB<112> $matches[0] => "a"` [download] ¹) `perlop#Regexp-Quote-Like-Operators` `* Matching in list context` `If the "/g" option is not used, "m//" in list context returns a list consisting of the subexpressions matched by the parentheses in the pattern, i.e., ($1, $2, $3...).`	[reply] [d/l] [select]
Re^2: how to extract string by possible groupings? by muba (Priest) on Jun 02, 2014 at 18:48 UTC
I tried to come up with a similar illustration of how grouping works but gave up after 10 minutes of coming up with nothing comprehendable. I think you managed to do it quite elegantly, for which ++.	[reply]
Re^3: how to extract string by possible groupings? by LanX (Saint) on Jun 02, 2014 at 19:57 UTC
Thanks, but we answered this questions already so many times, I even doubt this visualization was originally my idea! :) Cheers Rolf (addicted to the Perl Programming Language)	[reply]
Re: how to extract string by possible groupings? by muba (Priest) on Jun 02, 2014 at 15:42 UTC
There are six things obviously wrong with your regex: `\s` matches a single whitespace character, but as far as I can tell from your sample input, there could be multiple spaces between the columns. `\s` should be written `\s+`. You have included the `\s+` inside the parens, meaning that the white spaces separating the columns are part of the data you're trying to capture (in other words, `$match[0]` won't be `"test1.cpp"`, it will actually be `"test1.cpp "`, and likewise `$match[1]` will have trailing spaces). A percent sign doesn't carry any special meaning inside regular expressions, and thus it doesn't need to be escaped. You use the `/g` modifier even though you don't need it. Your grouping and capturing is a little off, and way too complex. A good practice is DRY, or Don't Repeat Yourself. A good way to adhere to the DRY principle is to generalize stuff as much as possible. You violate this principle, though. Regarding grouping and capturing, remember that every pair of parens inside a regex creates a capturing group, and captured substrings are returned in order of appearance (added: as LanX++ beautifully illustrated). Consider the following snippet: `$string = "foo bar"; @match = $string =~ m/(f(oo)) (b(ar))/ print "$match[0]\n"; # prints "foo" (captured by /(f(oo))/ print "$match[1]\n"; # prints "oo" (captured by /(oo)/ print "$match[2]\n"; # prints "bar" (captured by /(b(ar))/ print "$match[3]\n"; # prints "ar" (captured by /(ar)/` [download] Likewise, you seem to think that your `@match` variable will contain three elements, but as a matter of fact it will contain 8 (*eight!): one for every pair of parens in your regex, some of which only surround non-data such as the word "of" or just whitespace `\s+`. Don't believe me? Do me a favour and run this snippet (in which I only fixed the `\s` vs `\s+` issue) `use Data::Dumper; while (chomp(my $line = <DATA>)) { @match = $line =~ m/((.\.c\s+)\|(.\.h\s+)\|(.\.cpp\s+))\|(\s+(.) +\%\s+(of)\s+\d+\s)\|(\bNone\b)/; print "$line\n"; print Dumper \@match; } __DATA__ Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None` [download] The output I get: `[... snip ...] test1.cpp 0.00% of 21 0.00% of 16 $VAR1 = [ 'test1.cpp ', undef, undef, 'test1.cpp ', undef, undef, undef, undef [... snip ...]` [download] This neatly demonstrates at least three things: You've captured the filename twice (once because of the outer group, once because of the extension-specific group for .cpp). The matched file name includes the trailing white space, which I don't think is part of the filename anyway. Your `@match` array contains way more elements than you think it does - nearly three times as much! As for the DRY principle, you violate this for example in the chunk of the regex where you try to capture the file names. What you have written is: "match any number of characters, a literal period, a literal 'c', white space; OR match any number of characters, a literal period, a literal 'cpp', white space space; OR match any (...)" I'm sure you get the pattern. The way I would have written it, would read as: "match any number of characters, a literal period, one of these literal strings ('c', 'cpp', 'h'), whitespace." `/(.\.(?:c\|cpp\|h))\s+/ # Use (?:...) to create a non-capturing group +.` [download] The readability of your script could use some work too. Here's how I would've written it: # I always start my script with these two lines. # They prevent you from making various mistakes # and make debugging a whole lot easier. use strict; use warnings; # Regular expressions have the tendency to become long # strings of near-undecipherable line noise. To avoid # that, I usually like to split them up in smaller # logical chunks. # In this case, I'd write one regex to capture the # file names and one regex to capture percentages. my $title_re = qr/.*\.(?:c\|cpp\|h)/; my $percent_re = qr/(?:\d+\.\d+% of \d+\|None)/; # Next thing is to combine them into a single # regex to match the input against. # I use the /x modifier so that I can use # white space and comments inside the tegex. my $line_re = qr/ ($title_re) \s+ # Match and capture file names, match whit +espace ($percent_re) \s+ # Match and capture Percent2, match non-da +ta ($percent_re) # Match and capture Percent3 /x; <DATA>; # Read and discard the first line, as this contains non-data. # Read input line by line, cut off newline # characters from the end. while (my $line = <DATA>) { chomp $line; # Match input against the regex, capture # the stuff into separate variables. # I mean, I find a "$title" much more # comprehensible than "$match[0]". my ($title, $percent2, $percent3) = $line =~ $line_re; print "$line\n"; print "Title: $title\n"; print "Percent2: $percent2\n"; print "Percent3: $percent3\n"; print "\n"; } __DATA__ Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None [download] test1.cpp 0.00% of 21 0.00% of 16 Title: test1.cpp Percent2: 0.00% of 21 Percent3: 0.00% of 16 test2.c None 16.53% of 484 Title: test2.c Percent2: None Percent3: 16.53% of 484 test3.h 0.00% of 138 None Title: test3.h Percent2: 0.00% of 138 Percent3: None C:\Users\Lona\Desktop>perl x.pl test1.cpp 0.00% of 21 0.00% of 16 Title: test1.cpp Percent2: 0.00% of 21 Percent3: 0.00% of 16 test2.c None 16.53% of 484 Title: test2.c Percent2: None Percent3: 16.53% of 484 test3.h 0.00% of 138 None Title: test3.h Percent2: 0.00% of 138 Percent3: None [download]	[reply] [d/l] [select]
Re^2: how to extract string by possible groupings? by Laurent_R (Canon) on Jun 02, 2014 at 16:54 UTC
I wish I could upvote more than once such a useful, detailed and complete post.	[reply]
Re^3: how to extract string by possible groupings? by muba (Priest) on Jun 02, 2014 at 18:45 UTC
As much as those warm words are appreciated, I do think I could've been even more complete by including links to relevant sections of the documentation, but I didn't feel like it ;)	[reply]
Re^2: how to extract string by possible groupings? by adrive (Scribe) on Jun 03, 2014 at 02:24 UTC
thanks! this is really clear and easy to understand. although, what does the symbol ":?" mean? also..i didn't even know qr can prepare regex pattern.. I guess I'm too rusty in perl!!	[reply]
Re^3: how to extract string by possible groupings? by LanX (Saint) on Jun 03, 2014 at 02:37 UTC
> what does the symbol ":?" mean its `(?:...)` not `:?` see (like already mentioned) `perlre#Extended-Patterns` Cheers Rolf (addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re^3: how to extract string by possible groupings? by Laurent_R (Canon) on Jun 03, 2014 at 06:49 UTC
(?:...) is used for non capturing parentheses. This is useful when you need to regroup a subpattern (for example for an alternation or a quantification), but are not interested in capturing the content in $1, $2, etc.	[reply]
Re: how to extract string by possible groupings? by no_slogan (Deacon) on Jun 02, 2014 at 14:42 UTC
Can you maybe use something like: `@match = split /\s{2,}/, $_;` or `@match = split /\t/, $_;`	[reply] [d/l] [select]
Re^2: how to extract string by possible groupings? by adrive (Scribe) on Jun 03, 2014 at 02:26 UTC
oh man..........this is the simpliest and it is applicable to my case since the group separation is only if it is more than 1 space. thanks a bunch	[reply]
Re: how to extract string by possible groupings? by AnomalousMonk (Archbishop) on Jun 02, 2014 at 16:31 UTC
The approach of factoring regex sub-expressions can also be helpful. `$RE{num}{real}` is from Regexp::Common. The `$title` regex won't properly match something like `'/foo/bar/test.c'` so this regex (and the others) may need to be refined; this is easier to do if regexes have been factored into individual components. c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common; ;; my @lines = ( 'test1.cpp 0.00% of 21 0.00% of 16', 'test2.c None 16.53% of 484', 'test3.h 0.00% of 138 None', '/x/y/foo.c 0.00% of 1 None', ); ;; my $title = qr{ \w+ (?: [.] \w+)* }xms; my $percent = qr{ $RE{num}{real} % \s+ of \s+ \d+ }xms; my $none = qr{ None }xms; ;; for my $line (@lines) { print qq{line '$line'}; die qq{bad line: '$line'} unless my ($t, $p1, $p2) = $line =~ m{ \A ($title) \s+ ($percent \| $none) \s+ ($percent \| $none) \s* \ +z }xms; print qq{ title: '$t' pcent1: '$p1' pcent2: '$p2'}; } " line 'test1.cpp 0.00% of 21 0.00% of 16' title: 'test1.cpp' pcent1: '0.00% of 21' pcent2: '0.00% of 16' line 'test2.c None 16.53% of 484' title: 'test2.c' pcent1: 'None' pcent2: '16.53% of 484' line 'test3.h 0.00% of 138 None' title: 'test3.h' pcent1: '0.00% of 138' pcent2: 'None' line '/x/y/foo.c 0.00% of 1 None' bad line: '/x/y/foo.c 0.00% of 1 None' at -e line 1. [download] Update: Changed code example to better demonstrate error handling.	[reply] [d/l] [select]
Re: how to extract string by possible groupings? by BillKSmith (Monsignor) on Jun 03, 2014 at 04:35 UTC
I prefer to match each field separately. #!perl use strict; use warnings; FILE_EXPECTED_RESULT = DATA; while (<FILE_EXPECTED_RESULT>) { next if /^\s$/; chomp; print "\n", $_ , "\n"; my (@match) = / ( \w \. (?: c\|cpp\|h ) ) # File Namew \s* ( None \| \d{1,2}\.\d\d%\sof\s\d{1,3} ) # Percent 2 \s* ( None \| \d{1,2}\.\d\d%\sof\s\d{1,3} ) # Percent 3 /xms; print "title : " . $match[0] . "\n"; print "percent2 : " . $match[1] . "\n"; print "percent3 : " . $match[2] . "\n"; } close(FILE_EXPECTED_RESULT); __DATA__ Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None [download] Bill	[reply] [d/l]


Perl-Sensitive Sunglasses
	PerlMonks

how to extract string by possible groupings?

update

update