Link rot, what a pain. Here's something very quickly thrown together based upon your expanded criteria.
- Create a new directory, copy your htm files in there.
- Install the required modules, run the following from the command prompt: cpanm Mojolicious Path::Tiny.
- Download the code below to the same location. Run the code.
This code reads the content of each htm file in a directory, parses it with Mojo::DOM, finds all links, checks each URL with Mojo::UserAgent , if it looks like it's dead it'll remove the parent HTML element. Saving the file after.
Example HTML:
<html>
<head>
<title>test</title>
</head>
<body>
<ul>
<li><a href="http://perlmonks.org">perlmonks</a></li>
<li><a href="http://archive.org">archnive.org</a></li>
<li><a href="http://sitedoesnotexist9999.net">fakesite</a></li>
</ul>
</body>
</html>
Perl code:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Path::Tiny;
use Mojo::DOM;
use Mojo::UserAgent;
# get current directory
my $dir = Path::Tiny->cwd;
# for each html file
for ( $dir->children( qr/\.htm$/ ) ){
# read the contents into a variable
my $html = path( $_->basename )->slurp;
# get the dom
my $dom = Mojo::DOM->new( $html );
# find all links
for( $dom->find('a')->each ){
# get target href
my $url = $_->attr('href');
say "checking link $url";
# use Mojo::UserAgent to check if link is alive
my $ua = Mojo::UserAgent->new;
my $res;
eval { $res = $ua->max_redirects(5)->head( $url )->result };
# if an error is thrown
if ( $@ ){
warn "$url seems dead, removing parent";
$_->parent->remove;
}
# play nice
sleep(10);
}
# save file
path( $_->basename )->spew($dom->content);
}
Example HTML after running program:
<html>
<head>
<title>test</title>
</head>
<body>
<ul>
<li><a href="http://perlmonks.org">perlmonks</a></li>
<li><a href="http://archive.org">archnive.org</a></li>
</ul>
</body>
</html>
Since I don't have an example of what you're actually using, and things like [404 Not Found] don't often make sense to keep around, I removed them, however simply using the replace method rather than remove on the parent does exactly what you want:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Path::Tiny;
use Mojo::DOM;
use Mojo::UserAgent;
# get current directory
my $dir = Path::Tiny->cwd;
# for each html file
for ( $dir->children( qr/\.htm$/ ) ){
# read the contents into a variable
my $html = path( $_->basename )->slurp;
# get the dom
my $dom = Mojo::DOM->new( $html );
# find all links
for( $dom->find('a')->each ){
# get target href
my $url = $_->attr('href');
say "checking link $url";
# use Mojo::UserAgent to check if link is alive
my $ua = Mojo::UserAgent->new;
my $res;
eval { $res = $ua->max_redirects(5)->head( $url )->result };
# if an error is thrown
if ( $@ ){
warn "$url seems dead, updating link";
$_->replace('[404 Not Found]');
}
# play nice
sleep(10);
}
# save file
path( $_->basename )->spew($dom->content);
}
Which outputs:
<html>
<head>
<title>test</title>
</head>
<body>
<ul>
<li><a href="http://perlmonks.org">perlmonks</a></li>
<li><a href="http://archive.org">archnive.org</a></li>
<li>[404 Not Found]</li>
</ul>
</body>
</html>
There's a 10 second sleep in there, don't batter URLs. There is room for optimisation, for example if the same URL occurs more than once per file, a list of tested working URLs etc, but I'll leave that as an exercise for you.
Update: small addition.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.