Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Cropping the output of the pattern matcher

by wog (Curate)
on Sep 23, 2001 at 23:44 UTC ( [id://114204]=note: print w/replies, xml ) Need Help??


in reply to Cropping the output of the pattern matcher

For parsing HTML you are best off avoiding a regex. The reason for this is that HTML is not easy to parse, for example:

<!-- > A really funky image. --> <img src="light.gif" alt=">>LIGHT<<" /> <!-- was: <img src="light.jpg" alt="<light>" /> --> This is some text.

Because > and < can appear other then deliminating HTML tags, HTML parsing is probably best left off to HTML::TokeParser or HTML::Parser. For your case you might also want to look at HTML::TableExtract.

If you want to use your pattern, you can capture text using parenthesis, which will place the captured text in to the $<digit> variables, or in the result of the match in list context.

Note that your regex parses very differently from how you think it does. Here is the output of -MO=Deparse on it, modified to use m// instead of // so regexes stand out:

m/>\s+\w*</ | m/>\w*</ | m/>\w*</s + m//

I doubt this is the way you think it parses.

However, besides the fact it does not compile with those deliminators, your regex needs work to match the way you document it as matching. A straightforward translation of your specification would be:

if (/>(\s*[[:alnum:]]*)</) { my $matched = $1; # ... } else { # didn't match }

(Note that \w does not match just alphanumerics (it includes _) so I did not use it there. I also suspect you defined what you want to match incorrectly. update: I also excluded the 0 or more spaces after the "<" because it will always find at least 0 spaces.)

(update: minor rephrasing to make things make more sense.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://114204]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2024-04-26 04:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found