Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Link extractor

by marcelo.magallon (Acolyte)
on Nov 17, 2004 at 22:01 UTC ( #408617=sourcecode: print w/replies, xml ) Need Help??
Category: Web Stuff
Author/Contact Info Marcelo Magallon
Description: Small script to extract all links from a given URL. I whipped up something real quick sometime ago, and I use this for extracting links from different webpages (e.g. wget $(lnkxtor URL '\.tar\.gz$') and the like). Comments about further development directions and improvements much welcomed!

use strict;
use warnings;
use HTML::TreeBuilder;
use LWP::Simple;
use URI;
use Getopt::Std;

my %opts;

getopts('i', \%opts);

my ($tag, $href) = exists $opts{i} ? ('img', 'src') : ('a', 'href');

if (@ARGV < 1 or @ARGV > 2)
    die "Invalid number of arguments";

my ($url, $regex) = @ARGV;
my $uri = URI->new($url);
my $tree;

$regex ||= '.';

if (-f $url)
    $tree = HTML::TreeBuilder->new_from_file($url);
    my $content = get($uri);
    die unless defined $content;
    $tree = HTML::TreeBuilder->new_from_content($content);

die unless defined $tree;

foreach my $link ($tree->look_down(_tag => $tag, $href => qr{$regex}))
    my $link_url = URI->new_abs($link->attr($href), $uri);
    print $link_url->as_string, "\n";
Replies are listed 'Best First'.
Re: Link extractor
by Limbic~Region (Chancellor) on Nov 18, 2004 at 00:46 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://408617]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2020-09-30 01:12 GMT
Find Nodes?
    Voting Booth?
    If at first I donít succeed, I Ö

    Results (156 votes). Check out past polls.