Re: Cropping the output of the pattern matcher

For parsing HTML you are best off avoiding a regex. The reason for this is that HTML is not easy to parse, for example:

<!-- 
   > A really funky image.
-->
<img src="light.gif" alt=">>LIGHT<<" />
<!-- was: <img src="light.jpg" alt="<light>" /> -->
This is some text.
[download]

Because > and < can appear other then deliminating HTML tags, HTML parsing is probably best left off to HTML::TokeParser or HTML::Parser. For your case you might also want to look at HTML::TableExtract.

If you want to use your pattern, you can capture text using parenthesis, which will place the captured text in to the $<digit> variables, or in the result of the match in list context.

Note that your regex parses very differently from how you think it does. Here is the output of -MO=Deparse on it, modified to use m// instead of // so regexes stand out:

m/>\s+\w*</ | m/>\w*</ | m/>\w*</s + m//
[download]

I doubt this is the way you think it parses.

However, besides the fact it does not compile with those deliminators, your regex needs work to match the way you document it as matching. A straightforward translation of your specification would be:

  if (/>(\s*[[:alnum:]]*)</) {
    my $matched = $1;
    # ...
  } else {
    # didn't match
  }
[download]

(Note that \w does not match just alphanumerics (it includes _) so I did not use it there. I also suspect you defined what you want to match incorrectly. update: I also excluded the 0 or more spaces after the "<" because it will always find at least 0 spaces.)

(update: minor rephrasing to make things make more sense.)

Comment on Re: Cropping the output of the pattern matcher Select or Download Code


Problems? Is your data what you think it is?
	PerlMonks