http://qs321.pair.com?node_id=33620

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

How can I find the links in HTML tags?

Originally posted as a Categorized Question.

Replies are listed 'Best First'.
Re: How can I find the links in HTML tags?
by dchetlin (Friar) on Sep 24, 2000 at 22:44 UTC

    The following 4-liner uses HTML::Parser; it was both easy to code and correct — unlike any regex solution you're likely to see.

    As coded, it assumes you're looking only for href links within anchor tags, but it is easily modified for other things, such as img tags.

    use HTML::Parser; my $p = HTML::Parser->new( api_version => 3 ); $p->handler( start => sub { printshift->{href} if shift eq 'a' }, 'tag +name,attr' ); local $/; $p->parse(<>); #Just another URI finder
Re: How can I find the links in HTML tags?
by merlyn (Sage) on Sep 22, 2000 at 13:03 UTC
Re: How can I find the links in HTML tags?
by Dermot (Scribe) on Sep 24, 2000 at 20:54 UTC
    As contributed by merlyn, reusing the available CPAN module is probably the best way to go. However, if you just want to see how it's done and presuming that you would like to see all links then here is how Troc does it with his collate module. This doesn't handle links in Javascript but that's a difficult problem as links in Javascript can be the value of expressions.
    open(DOC, "<$doc_filename") || die "can't open doc $doc_filename: +$!"; binmode(DOC); $document = <DOC>; close(DOC); while ($document =~ m{< ?(.*?) ?>}g) { $tag = $1; # find tags with links $link = ($tag =~ m{ (background|src|usemap|action|href)\s?=\s? (['"]*) ([^\2 ]+)\2 }xi)[2]; next unless (defined $link);
      Well, I hate to be the bad guy, but I think a comment is needed here. There are several problems with the above code:

      1. It doesn't work, even as designed (it's only getting the first line from the file, because <DOC> is in scalar context)
      2. It's bad style
        • Aside from the error in reading the file, the binmode is dubious
        • Assuming the intent is really to slurp the entire file, that's generally not a good thing to do -- and that's not a good idiom to use to do it
        • Use of .*? in a REx is almost always a bad choice -- here it should be [^>]*
      3. It's shockingly inefficient -- if you're up for some fun, run it through re 'debug' some time
      4. Finally, and most importantly, it's wrong. I added a print statement and closed the while (and fixed the slurping error) and ran the following perfectly valid HTML through it:
        <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <HTML> <HEAD><TITLE>Test</TITLE></HEAD> <BODY> <a href ="foobaz"></a> <a name="href=foo"></a> <a name="foo>bar" href="foobar"></a> </BODY> </HTML>
        And it missed the two valid href links (the first and the last) and wrongly flagged the second one as an href link.

      I think it's very important to realize that while doing this sort of thing seems easy, it isn't. There are a lot of cases that you will miss if you try. Use LinkExtor. Use HTML::Parser. Use HTML::TokeParser. Heck, use URI::Find. Just don't "do it yourself" unless you're prepared to devote quite a bit of time developing, honing, and fixing your solution.

      I think that Dermot knows all of this, judging from his comment that using the CPAN is probably the best way to go, but I wanted to make sure no one decided to use this instead because it was "easier". Do not.

      I leave you with something I posted to Usenet not too long ago -- a script that correctly finds all anchor links in a document -- in 4 lines of Perl. It's that easy to do with HTML::Parser or one of the other tools made for such things.

      -dlc

      #!/usr/bin/perl -wl use strict;use HTML::Parser;my $p=HTML::Parser->new(api_version =>3);$p->handler(start=>sub{print shift->{href}if shift eq 'a'}, 'tagname,attr');local $/;$p->parse(<>);#Just another URI finder
      Bad guy, no. I don't offend that easily and your comments are in the main valid. I'm a bit embarrased that my first submission to PM went down in flames :) I have a couple of questions on the comments though:
      
          1. The code referred to uses 'undef $/' to facilitate
             the slurp. I should have included this in the example.
             I take it only fully valid running code should be 
             posted here. I'll do that in future.
      
          2. Why is binmode dubious ? I agree that ^>* is a much
             better choice for the regex.
      
          3. Showing my ignorance. What is re 'debug' ?
      
      
      I totally agree that parsing HTML is tricky and using the modules available is definitely the way to go. I learnt that while using the above code to parse some HTML. My apologies if I didn't make that clear enough.
Re: How can I find the links in HTML tags?
by hawson (Monk) on Sep 22, 2000 at 19:00 UTC
    @links = $html =~ /<a\s+href="([^"]+)"/gi;

    Of course, this isn't robust and is easy to fool. But it'll work in a lot of cases, especially if you have control over the quality of the HTML input.