Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Change Absolute to Relative links in HTML files

by dkubb (Deacon)
on Feb 05, 2001 at 02:22 UTC ( [id://56338]=sourcecode: print w/replies, xml ) Need Help??
Category: HTML Utility
Author/Contact Info Dan Kubb (dkubb) dkubb@cpan.org
Description:

This utility will recurse through a specified directory, parse all the .htm and .html files, and replace any absolute URL's with relative URL's to a base you define.

You can also specify what types of links to parse: img, src, action, or any others. Please see HTML::Tagset's %linkElements hash, in the module's source, for a precise breakdown of supported tag-types.

This program was good practice for trying out Getopt::Declare, an excellent command-line parser. Please note the parameter specification below the __DATA__ tag.

Disclaimer: Always use the -b switch to force backups, just in case you have non-standard HTML and the HTML::TreeBuilder parser mangles it.

Comments and suggestions for improvement are always welcome and very much appreciated.

#!/usr/bin/perl -w

use strict;
use Getopt::Declare;
use Cwd;
use File::Find;
use HTML::TreeBuilder;
use URI::URL qw(url);
use File::Copy;
use IO::File;

use vars qw($VERSION $BASE_URL $BACKUP $DIRECTORY @WANTED_TYPES);

$VERSION = (qw$Revision: 1.1 $)[1];

#Load the definition and grab the command-line parameters
my $opts = Getopt::Declare->new( do{ local $/; <DATA> } );

#Cycle through this directory, and all those underneath it
find(\&wanted, $DIRECTORY || getcwd);

#Parse each HTML file and make a backup
#of it using File::Copy::copy.
sub wanted {
  return unless $File::Find::name =~ /html?$/;

  #Extract Links from the file
  my $h = HTML::TreeBuilder->new;
  $h->parse_file($File::Find::name);

  my $link_elements = $h->extract_links(@WANTED_TYPES);
  return unless @$link_elements;

  #Grab each img src and re-write them so they are relative URL's
  foreach my $link_element (@$link_elements) {
    my $link    = shift @$link_element; #URL value
    my $element = shift @$link_element; #HTML::Element Object

    my $url = url($link)->canonical;
    next unless $url->can('host_port') and
      $BASE_URL->host_port eq $url->host_port;

    #Sigh.. The following is because extract_links() doesn't
    #tell you which attribute $link belongs to, except to say
    #it is the value of an attribute in $element.. somewhere.

    #Given the $link, find out which attribute it was for
    my ($attr) = grep {
      defined $element->attr($_) and $link eq $element->attr($_)
    } @{ $HTML::Tagset::linkElements{$element->tag} };

    #Re-write the attribute in the HTML::Element Tree
    #Note: $BASE_URL needs to be quoted here.
    $element->attr($attr, $url->path("$BASE_URL"));
  }

  #Make a backup of the file before over-writing it
  copy $File::Find::name => $File::Find::name.'.bak'
    if defined $BACKUP;

  #Save the updated file
  my $fh = IO::File->new($File::Find::name, O_RDWR)
    or die "Could not open $File::Find::name: $!";
  $fh->print($h->as_HTML);
}

__DATA__
#If there is an error here, you need to have one tab
#between the <$var> and the option description.
 -u <base_url>         Base URL (http://www.yoursite.com) [required]
                      { $BASE_URL = url($base_url)->canonical }
 -b                    Backup changed files
                      { $BACKUP = 1  }
 -d <directory>        Starting Directory to recurse from
                      { $DIRECTORY = $directory }
 -l <links>...         Links to process: img, href, etc [required]
                      { @WANTED_TYPES = @links }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://56338]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-04-24 19:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found