Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

lovely regexs

by sv123 (Initiate)
on Apr 11, 2009 at 21:30 UTC ( [id://757059]=perlquestion: print w/replies, xml ) Need Help??

sv123 has asked for the wisdom of the Perl Monks concerning the following question:

So I have some html I need to extract the IMG from. An example would be something like;
<p><a href='/blah.html' target='_self' style='text-decoration: none;'> +<img id="blah.flv" src="http://WHAT-I-NEED-TO-GR.AB" height="232" wid +th="308" onmouseout='endm("etc"); this.src="http://unimportant.jpg";' + onmouseover='startm("etc","http://etc",".jpg");' border=0></a><br>s +ome text</p>
It's the text inside the first "src=" I'm interested in. I tried something like;
$moo =~ m/src="(.*)"/;
and variations of that with no luck. Since I have the most luck here asking, well... Here I am! Thanks (yet again!) in advance (:

Replies are listed 'Best First'.
Re: lovely regexs
by repellent (Priest) on Apr 11, 2009 at 22:38 UTC
    Avoid using regular expressions on HTML.

    HTML::TokeParser::Simple can do the job for you. Hints:
    • is_start_tag("img")
    • get_attr("src")
Re: lovely regexs
by gmargo (Hermit) on Apr 12, 2009 at 00:41 UTC
    Perhaps change the regular expression to turn off the default "greediness" of the "*", with a "?" quantifier, so that it gathers only up to the next quote character.
    $moo =~ m/src="(.*?)"/;
    However, I normally use HTML::TreeBuilder to parse and search html.
      Ignoring, for the moment, the wisdom of using the proper tool (which is generally not a regex) for parsing HTML...

      The issue here is not greediness. The issue is the misuse of ".*". Making the "*" non-greedy is just a band-aid which masks the fact that ".*" says "match any number of any characters", when what you actually mean is "match any number of any non-double quote characters". The correct way to write that regex is:

      $moo =~ m/src="([^"]*)"/;

      The non-greedy qualifier does have its legitimate uses, generally in cases where your target is terminated by a sequence of multiple characters. In cases where a negated character class can do the job, though, the character class will almost always be the better option.

Re: lovely regexs
by CountZero (Bishop) on Apr 12, 2009 at 12:55 UTC
    Or use HTML::SimpleLinkExtor's $extor->src $extor->img method.

    Update: changed src to img. Thanks to Cody Pendant

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: lovely regexs
by Cody Pendant (Prior) on Apr 12, 2009 at 23:19 UTC
    Just in case this hasn't occurred to you, there are other things in HTML which have a src attribute.


    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...
      Indeed and the $extor->src method returns them all. Fortunately there is another method $extor->img that returns the src attribute of the img links only.
      use strict; use warnings; use HTML::SimpleLinkExtor; my $extor = HTML::SimpleLinkExtor->new(); my $url = 'http://www.perlmonks.org'; $extor->parse_url($url); foreach my $src ( $extor->img ) { print "$src\n"; }
      Output:
      http://promote.pair.com/i/pair-banner-current.gif http://perlmonks.org/images/monkpics/pater_hat_sm.gif

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://757059]
Approved by repellent
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-24 02:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found