Simple link extraction tool

Category:	Web stuff
Author/Contact Info	Scott Peterson peterson146@gmail.com
Description:	This code gets a web page that you specify at the command line and extracts all links from that page to a list in ASCII text file format. Comments and suggestions for improvement welcome! Updated per suggestions by ikegami and merlyn.
`use strict; use WWW::Mechanize; #usage perl linkextractor.pl http://www.example.com/ > output.txt my $url = shift; my $mech = WWW::Mechanize->new(); $mech->get($url); my $status=$mech->status(); print $status." OK-URL request succeeded."."\n"; my @links = $mech->links; print STDOUT ($_->url, $/) foreach $mech->links;`

Comment on Simple link extraction tool Download Code

Replies are listed 'Best First'.
Re: Simple link extraction tool by ikegami (Patriarch) on Jan 02, 2007 at 21:43 UTC
How am I suppose to use your program? This clobbers any existing listurls.txt, gives me two copies of the data and puts a useless status message in preferedname.txt: `linkextractor http://www.blah.com/ > preferedname.txt` [download] This clobbers any existing listurls.txt and puts a useless status message in urls.txt: `linkextractor http://www.blah.com/ > preferedname.txt & del listurls.t +xt` [download] This clobbers any existing listurls.txt and loses any error status message: `linkextractor http://www.example.com/ > nul & move listurls.txt prefer +edname.txt` [download] Suggestions: Don't say it's OK when it isn't. Use the correct message. Don't say it's OK when it is. Only send the URIs to STDOUT. Send error messages (incl non 200 status messages) to STDERR. Convert the URIs to absolute URIs. Remove duplicate URIs. Replace `my $url = <@ARGV>;` with `my ($url) = @ARGV;`. The domain `www.example.com` (among others) was set aside for examples. It's better to use that than `www.blah.com`, a real live domain. Suggestions applied: `use strict; use warnings; use List::MoreUtils qw( uniq ); use WWW::Mechanize qw( ); # usage: linkextractor http://www.blah.com/ > listurls.txt my ($url) = @ARGV; my $mech = WWW::Mechanize->new(); my $response = $mech->get($url); $response->is_success() or die($response->status_line() . "\n"); print map { "$_\n" } sort { $a cmp $b } uniq map { $_->url_abs() } $mech->links();` [download] Update: At first, I didn't realize it was outputing to STDOUT in addition to `listurls.txt`. I recommended that the output should be sent to STDOUT. This is a rewrite.	[reply] [d/l] [select]
Re^2: Simple link extraction tool by Scott7477 (Chaplain) on Jan 02, 2007 at 23:38 UTC
Thanks for taking the time to educate me and produce working code per your suggestions. Prior to posting my code, what I found with Super Search was that any queries regarding the existence of code like this simply got referred to CPAN modules; which was mildly suprising as many SOPW's get responses with code snippets that solve their problem. I later found brian d. foy's Re: Creating a web crawler (theory) which points to his webreaper which is apparently designed to download entire websites.	[reply]
Re^3: Simple link extraction tool by jdporter (Paladin) on Jan 03, 2007 at 05:13 UTC
One of the things you want to do when previewing a post is check that all your links go where you meant them to go. If you had done this, you would have found that your "webreaper" link doesn't work. You could have even simply copied the link from the source node: webreaper. Instead, you (apparently) wrote `[cpan://dist/webreaper/]`. ++ for a good guess, but it's wrong. The PerlMonks way to link efficiently to a distribution on CPAN is with `[dist://webreaper]` (⇒ webreaper). This is documented at What shortcuts can I use for linking to other information? Moral: Verify your links when you post. A word spoken in Mind will reach its own level, in the objective world, by its own weight	[reply] [d/l] [select]
Re: Simple link extraction tool by merlyn (Sage) on Jan 02, 2007 at 21:53 UTC
`my $url = <@ARGV>;` [download] That's definitely some bizarre and easy to break code. I think you intended this: `my $url = shift;` [download] See my previous writeup on why that's broken. -- Randal L. Schwartz, Perl hacker	[reply] [d/l] [select]
Re: Simple link extraction tool by arkturuz (Curate) on Jan 03, 2007 at 12:47 UTC
I always check for user input and the (non)existence of it, so I added simple: `my $url = shift; if (!$url) { print "Usage: $0 URI\n"; exit 1; }` [download] Also if user enters URI without 'http', add it: `if ($url !~ m(^http://)i) { $url = 'http://' . $url; }` [download] Checking and validating user input is (for me) a must-have task, even for simple and small program like this.	[reply] [d/l] [select]
Re: Simple link extraction tool by Scott7477 (Chaplain) on Jan 03, 2007 at 22:09 UTC
Turns out linkextor — extract particular links from HTML documents by Aristotle accomplishes what I wanted to get done. At least I learned something:\|	[reply]
Re: Simple link extraction tool by davidrw (Prior) on Jan 03, 2007 at 18:00 UTC
a `lynx`/`perl` solution: `lynx --dump http://www.example.com \| perl -0777 -pe 's/.+^References[\ +r\n]+//sm' # or, to also strip the numbers: lynx --dump http://www.example.com \| perl -0777 -pe 's/.+^References[\ +r\n]+//sm; s/^\s*\d+\. //mg'` [download]	[reply] [d/l] [select]
Simple link extraction tool-another way by Scott7477 (Chaplain) on Jan 04, 2007 at 00:27 UTC
After consulting with merlyn and brian d foy, I came up with this: `use strict; use HTML::SimpleLinkExtor; use LWP::Simple; #usage linkextractor http://www.example.com > output.txt my $url = shift; my $content = get ($url); my $extor = HTML::SimpleLinkExtor->new(); $extor->parse($content); my @all_links = $extor->links; foreach my $elem (@all_links) { print $elem."\n"; }` [download] Update:: HTML::SimpleLinkExtor comes with a script linktractor that gets the job done just fine as well.	[reply] [d/l]

Back to Code Catacombs