Trivial HTML extractor utility

I have a really useful trivial utility, called linkx, that is basically just a command-line wrapper around HTML::LinkExtor: you give it the name of an HTML file and it extracts and prints all the URLs in the file. I use this all the time:

#!/usr/bin/perl

use HTML::LinkExtor;
use Getopt::Std ;
getopts('b:t:');

@ARGV = '-' unless @ARGV;
for my $file (@ARGV) {
  extract($file);
}

sub extract {
  my $file = shift;
  unless (open F, "< $file") {
    warn "Couldn't open file $file: $!; skipping\n";
    return;
  }
  my $p = HTML::LinkExtor->new(undef, $opt_b);
  while (read F, my $buf, 8192) {
    $p->parse($buf);
  }
  for my $ln ($p->links) {
    my @ln = @$ln;
    my $tag = shift @ln;
    next if $opt_t && lc($opt_t) ne lc($tag);
    while (@ln) {
      shift @ln;
      my $url = shift @ln;
      print $url, "\n" unless $seen{$url}++;
    }
  }
}
[download]

You can tell this is really old because it uses two-argument open.

The -b base flag interprets all URLs relative to base base and prints out the absolute versions. The -t tag flag restricts the program to only printing out URLs that appear in that kind of entity, instead of all links. I had totally forgotten that the -t feature was in there. I wonder if it's useful?

Anyway, that's not what I wanted to write about. I also have a program that extracts referrer URLs from my web logs, and today I noticed a bunch of incoming links from Reddit. Most of these I thought I had probably seen before, but I wasn't sure. I thought if I could see the titles of the pages I would know. So I tried to get the titles:

  for i in `cat reddit`; do
    GET $i | grep -i title
  done
[download]

hoping that the title element would be alone on a line. (GET is a utility that comes with Perl's LWP suite; you give it a URL and it fetches the document and prints it.)

This was a complete failure. Not only were the title elements not alone on their own lines, it seems that Reddit pages don't have any line breaks at all. The output was a big mess, and I didn't look into it in detail once I saw that this approach was a flop.

So I wrote the following item, htmlx, which solved the problem:

#!/usr/bin/perl

use HTML::TreeBuilder;

my @tags = @ARGV;

my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file(\*STDIN);
my @elements = $tree->find(@tags);

for (@elements) {
  my $s = $_->as_text;
  $s =~ tr/\n/ /;
  print "$s\n";
}
[download]

You give this a tag name, and then it reads HTML from standard input and prints the contents of all the entities with this tag. My Reddit searcher that didn't work became:

  for i in `cat reddit`; do
    GET $i | htmlx title
  done
[download]

which did work. Hooray!

I expect that this will be useful for other stuff, but I'm not sure yet what. If I never find another use for it except as part of a GET url | htmlx title pipeline, I'll probably demote it from htmlx to just TITLE, but it's too soon to tell if that's a good idea.

I hope this is useful for someone. I hereby place all code in this post in the public domain, yadda yadda yadda. Share and enjoy!

--
Mark Dominus
Perl Paraphernalia

Back to Meditations