http://qs321.pair.com?node_id=652311

I have a really useful trivial utility, called linkx, that is basically just a command-line wrapper around HTML::LinkExtor: you give it the name of an HTML file and it extracts and prints all the URLs in the file. I use this all the time:

#!/usr/bin/perl use HTML::LinkExtor; use Getopt::Std ; getopts('b:t:'); @ARGV = '-' unless @ARGV; for my $file (@ARGV) { extract($file); } sub extract { my $file = shift; unless (open F, "< $file") { warn "Couldn't open file $file: $!; skipping\n"; return; } my $p = HTML::LinkExtor->new(undef, $opt_b); while (read F, my $buf, 8192) { $p->parse($buf); } for my $ln ($p->links) { my @ln = @$ln; my $tag = shift @ln; next if $opt_t && lc($opt_t) ne lc($tag); while (@ln) { shift @ln; my $url = shift @ln; print $url, "\n" unless $seen{$url}++; } } }
You can tell this is really old because it uses two-argument open.

The -b base flag interprets all URLs relative to base base and prints out the absolute versions. The -t tag flag restricts the program to only printing out URLs that appear in that kind of entity, instead of all links. I had totally forgotten that the -t feature was in there. I wonder if it's useful?

Anyway, that's not what I wanted to write about. I also have a program that extracts referrer URLs from my web logs, and today I noticed a bunch of incoming links from Reddit. Most of these I thought I had probably seen before, but I wasn't sure. I thought if I could see the titles of the pages I would know. So I tried to get the titles:

for i in `cat reddit`; do GET $i | grep -i title done
hoping that the title element would be alone on a line. (GET is a utility that comes with Perl's LWP suite; you give it a URL and it fetches the document and prints it.)

This was a complete failure. Not only were the title elements not alone on their own lines, it seems that Reddit pages don't have any line breaks at all. The output was a big mess, and I didn't look into it in detail once I saw that this approach was a flop.

So I wrote the following item, htmlx, which solved the problem:

#!/usr/bin/perl use HTML::TreeBuilder; my @tags = @ARGV; my $tree = HTML::TreeBuilder->new; # empty tree $tree->parse_file(\*STDIN); my @elements = $tree->find(@tags); for (@elements) { my $s = $_->as_text; $s =~ tr/\n/ /; print "$s\n"; }
You give this a tag name, and then it reads HTML from standard input and prints the contents of all the entities with this tag. My Reddit searcher that didn't work became:

for i in `cat reddit`; do GET $i | htmlx title done
which did work. Hooray!

I expect that this will be useful for other stuff, but I'm not sure yet what. If I never find another use for it except as part of a GET url | htmlx title pipeline, I'll probably demote it from htmlx to just TITLE, but it's too soon to tell if that's a good idea.

I hope this is useful for someone. I hereby place all code in this post in the public domain, yadda yadda yadda. Share and enjoy!