lovely regexs

sv123 has asked for the wisdom of the Perl Monks concerning the following question:

So I have some html I need to extract the IMG from. An example would be something like;

<p><a href='/blah.html' target='_self' style='text-decoration: none;'>
+<img id="blah.flv" src="http://WHAT-I-NEED-TO-GR.AB" height="232" wid
+th="308" onmouseout='endm("etc"); this.src="http://unimportant.jpg";'
+ onmouseover='startm("etc","http://etc",".jpg");'  border=0></a><br>s
+ome text</p>
[download]

It's the text inside the first "src=" I'm interested in. I tried something like;

$moo =~ m/src="(.*)"/;
[download]

and variations of that with no luck. Since I have the most luck here asking, well... Here I am! Thanks (yet again!) in advance (:

Comment on lovely regexs Select or Download Code

Replies are listed 'Best First'.
Re: lovely regexs by repellent (Priest) on Apr 11, 2009 at 22:38 UTC
Avoid using regular expressions on HTML. HTML::TokeParser::Simple can do the job for you. Hints: `is_start_tag("img")` `get_attr("src")`	[reply] [d/l] [select]
Re: lovely regexs by gmargo (Hermit) on Apr 12, 2009 at 00:41 UTC
Perhaps change the regular expression to turn off the default "greediness" of the "", with a "?" quantifier, so that it gathers only up to the next quote character. `$moo =~ m/src="(.?)"/;` [download] However, I normally use HTML::TreeBuilder to parse and search html.	[reply] [d/l]
Re^2: lovely regexs by dsheroh (Monsignor) on Apr 12, 2009 at 22:20 UTC
Ignoring, for the moment, the wisdom of using the proper tool (which is generally not a regex) for parsing HTML... The issue here is not greediness. The issue is the misuse of ".". Making the "" non-greedy is just a band-aid which masks the fact that "." says "match any number of any* characters", when what you actually mean is "match any number of any non-double quote characters". The correct way to write that regex is: `$moo =~ m/src="([^"]*)"/;` [download] The non-greedy qualifier does have its legitimate uses, generally in cases where your target is terminated by a sequence of multiple characters. In cases where a negated character class can do the job, though, the character class will almost always be the better option.	[reply] [d/l]
Re: lovely regexs by CountZero (Bishop) on Apr 12, 2009 at 12:55 UTC
Or use HTML::SimpleLinkExtor's ~~`$extor->src`~~ `$extor->img` method. Update: changed `src` to `img`. Thanks to Cody Pendant CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: lovely regexs by Cody Pendant (Prior) on Apr 12, 2009 at 23:19 UTC
Just in case this hasn't occurred to you, there are other things in HTML which have a `src` attribute. Nobody says perl looks like line-noise any more kids today don't know what line-noise IS ...	[reply] [d/l]
Re^2: lovely regexs by CountZero (Bishop) on Apr 13, 2009 at 12:54 UTC
Indeed and the `$extor->src` method returns them all. Fortunately there is another method `$extor->img` that returns the `src` attribute of the `img` links only. `use strict; use warnings; use HTML::SimpleLinkExtor; my $extor = HTML::SimpleLinkExtor->new(); my $url = 'http://www.perlmonks.org'; $extor->parse_url($url); foreach my $src ( $extor->img ) { print "$src\n"; }` [download] Output: `http://promote.pair.com/i/pair-banner-current.gif http://perlmonks.org/images/monkpics/pater_hat_sm.gif` [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]


We don't bite newbies here... much
	PerlMonks