Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

There are six things obviously wrong with your regex:

  1. \s matches a single whitespace character, but as far as I can tell from your sample input, there could be multiple spaces between the columns. \s should be written \s+.
  2. You have included the \s+ inside the parens, meaning that the white spaces separating the columns are part of the data you're trying to capture (in other words, $match[0] won't be "test1.cpp", it will actually be "test1.cpp     ", and likewise $match[1] will have trailing spaces).
  3. A percent sign doesn't carry any special meaning inside regular expressions, and thus it doesn't need to be escaped.
  4. You use the /g modifier even though you don't need it.
  5. Your grouping and capturing is a little off, and way too complex.
  6. A good practice is DRY, or Don't Repeat Yourself. A good way to adhere to the DRY principle is to generalize stuff as much as possible. You violate this principle, though.

Regarding grouping and capturing, remember that every pair of parens inside a regex creates a capturing group, and captured substrings are returned in order of appearance (added: as LanX++ beautifully illustrated). Consider the following snippet:

$string = "foo bar"; @match = $string =~ m/(f(oo)) (b(ar))/ print "$match[0]\n"; # prints "foo" (captured by /(f(oo))/ print "$match[1]\n"; # prints "oo" (captured by /(oo)/ print "$match[2]\n"; # prints "bar" (captured by /(b(ar))/ print "$match[3]\n"; # prints "ar" (captured by /(ar)/

Likewise, you seem to think that your @match variable will contain three elements, but as a matter of fact it will contain 8 (eight!): one for every pair of parens in your regex, some of which only surround non-data such as the word "of" or just whitespace \s+.

Don't believe me? Do me a favour and run this snippet (in which I only fixed the \s vs \s+ issue)

use Data::Dumper; while (chomp(my $line = <DATA>)) { @match = $line =~ m/((.*\.c\s+)|(.*\.h\s+)|(.*\.cpp\s+))|(\s+(.*) +\%\s+(of)\s+\d+\s)|(\bNone\b)/; print "$line\n"; print Dumper \@match; } __DATA__ Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None

The output I get:

[... snip ...] test1.cpp 0.00% of 21 0.00% of 16 $VAR1 = [ 'test1.cpp ', undef, undef, 'test1.cpp ', undef, undef, undef, undef [... snip ...]

This neatly demonstrates at least three things:

  1. You've captured the filename twice (once because of the outer group, once because of the extension-specific group for .cpp).
  2. The matched file name includes the trailing white space, which I don't think is part of the filename anyway.
  3. Your @match array contains way more elements than you think it does - nearly three times as much!

As for the DRY principle, you violate this for example in the chunk of the regex where you try to capture the file names. What you have written is: "match any number of characters, a literal period, a literal 'c', white space; OR match any number of characters, a literal period, a literal 'cpp', white space space; OR match any (...)" I'm sure you get the pattern.

The way I would have written it, would read as: "match any number of characters, a literal period, one of these literal strings ('c', 'cpp', 'h'), whitespace."

/(.*\.(?:c|cpp|h))\s+/ # Use (?:...) to create a non-capturing group +.

The readability of your script could use some work too. Here's how I would've written it:

# I always start my script with these two lines. # They prevent you from making various mistakes # and make debugging a whole lot easier. use strict; use warnings; # Regular expressions have the tendency to become long # strings of near-undecipherable line noise. To avoid # that, I usually like to split them up in smaller # logical chunks. # In this case, I'd write one regex to capture the # file names and one regex to capture percentages. my $title_re = qr/.*\.(?:c|cpp|h)/; my $percent_re = qr/(?:\d+\.\d+% of \d+|None)/; # Next thing is to combine them into a single # regex to match the input against. # I use the /x modifier so that I can use # white space and comments inside the tegex. my $line_re = qr/ ($title_re) \s+ # Match and capture file names, match whit +espace ($percent_re) \s+ # Match and capture Percent2, match non-da +ta ($percent_re) # Match and capture Percent3 /x; <DATA>; # Read and discard the first line, as this contains non-data. # Read input line by line, cut off newline # characters from the end. while (my $line = <DATA>) { chomp $line; # Match input against the regex, capture # the stuff into separate variables. # I mean, I find a "$title" much more # comprehensible than "$match[0]". my ($title, $percent2, $percent3) = $line =~ $line_re; print "$line\n"; print "Title: $title\n"; print "Percent2: $percent2\n"; print "Percent3: $percent3\n"; print "\n"; } __DATA__ Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None
test1.cpp 0.00% of 21 0.00% of 16 Title: test1.cpp Percent2: 0.00% of 21 Percent3: 0.00% of 16 test2.c None 16.53% of 484 Title: test2.c Percent2: None Percent3: 16.53% of 484 test3.h 0.00% of 138 None Title: test3.h Percent2: 0.00% of 138 Percent3: None C:\Users\Lona\Desktop>perl x.pl test1.cpp 0.00% of 21 0.00% of 16 Title: test1.cpp Percent2: 0.00% of 21 Percent3: 0.00% of 16 test2.c None 16.53% of 484 Title: test2.c Percent2: None Percent3: 16.53% of 484 test3.h 0.00% of 138 None Title: test3.h Percent2: 0.00% of 138 Percent3: None

In reply to Re: how to extract string by possible groupings? by muba
in thread how to extract string by possible groupings? by adrive

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (2)
As of 2024-04-20 03:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found