The following 4-liner uses HTML::Parser; it was both easy to code and correct — unlike any regex solution you're likely to see.
As coded, it assumes you're looking only for href links within anchor tags, but it is easily modified for other things, such as img tags.
use HTML::Parser;
my $p = HTML::Parser->new( api_version => 3 );
$p->handler( start => sub { printshift->{href} if shift eq 'a' }, 'tag
+name,attr' );
local $/;
$p->parse(<>);
#Just another URI finder
| [reply] [Watch: Dir/Any] [d/l] |
As contributed by merlyn, reusing the available
CPAN module is probably the best way to go.
However, if you just want to see how it's done and
presuming that you would like to see all links
then here is how Troc does it with his
collate module. This doesn't handle links in
Javascript but that's a difficult problem as links
in Javascript can be the value of expressions.
open(DOC, "<$doc_filename") || die "can't open doc $doc_filename:
+$!";
binmode(DOC);
$document = <DOC>;
close(DOC);
while ($document =~ m{< ?(.*?) ?>}g) {
$tag = $1;
# find tags with links
$link =
($tag =~ m{
(background|src|usemap|action|href)\s?=\s?
(['"]*)
([^\2 ]+)\2
}xi)[2];
next unless (defined $link);
| [reply] [Watch: Dir/Any] [d/l] |
Well, I hate to be the bad guy, but I think a comment is needed here. There are several problems with the above code:
- It doesn't work, even as designed (it's only getting the first line from the file, because <DOC> is in scalar context)
- It's bad style
- Aside from the error in reading the file, the binmode is dubious
- Assuming the intent is really to slurp the entire file, that's generally not a good thing to do -- and that's not a good idiom to use to do it
- Use of .*? in a REx is almost always a bad choice -- here it should be [^>]*
- It's shockingly inefficient -- if you're up for some fun, run it through re 'debug' some time
- Finally, and most importantly, it's wrong. I added a print statement and closed the while (and fixed the slurping error) and ran the following perfectly valid HTML through it:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD><TITLE>Test</TITLE></HEAD>
<BODY>
<a href ="foobaz"></a>
<a name="href=foo"></a>
<a name="foo>bar" href="foobar"></a>
</BODY>
</HTML>
And it missed the two valid href links (the first and the last) and wrongly flagged the second one as an href link.
I think it's very important to realize that while doing this sort of thing seems easy, it isn't. There are a lot of cases that you will miss if you try. Use LinkExtor. Use HTML::Parser. Use HTML::TokeParser. Heck, use URI::Find. Just don't "do it yourself" unless you're prepared to devote quite a bit of time developing, honing, and fixing your solution.
I think that Dermot knows all of this, judging from his comment that using the CPAN is probably the best way to go, but I wanted to make sure no one decided to use this instead because it was "easier". Do not.
I leave you with something I posted to Usenet not too long ago -- a script that correctly finds all anchor links in a document -- in 4 lines of Perl. It's that easy to do with HTML::Parser or one of the other tools made for such things.
-dlc
#!/usr/bin/perl -wl
use strict;use HTML::Parser;my $p=HTML::Parser->new(api_version
=>3);$p->handler(start=>sub{print shift->{href}if shift eq 'a'},
'tagname,attr');local $/;$p->parse(<>);#Just another URI finder
| [reply] [Watch: Dir/Any] [d/l] [select] |
Bad guy, no. I don't offend that easily and your comments
are in the main valid. I'm a bit embarrased that my first
submission to PM went down in flames :) I have a couple of
questions on the comments though:
1. The code referred to uses 'undef $/' to facilitate
the slurp. I should have included this in the example.
I take it only fully valid running code should be
posted here. I'll do that in future.
2. Why is binmode dubious ? I agree that ^>* is a much
better choice for the regex.
3. Showing my ignorance. What is re 'debug' ?
I totally agree that parsing HTML is tricky and using the
modules available is definitely the way to go. I learnt
that while using the above code to parse some HTML. My
apologies if I didn't make that clear enough.
| [reply] [Watch: Dir/Any] |
@links = $html =~ /<a\s+href="([^"]+)"/gi;
Of course, this isn't robust and is easy to fool.
But it'll work in a lot of cases, especially if you have control over the quality of the HTML input.
| [reply] [Watch: Dir/Any] [d/l] |