Before I heard about WWW::Mechanize, LWP was my favorite module set. I did lots of website scraping with it, mostly for fun (e.g., reading Yahoo Finance stock message boards in the bubble years, getting stats on eBay, etc.) Now I use WWW::Mechanize, which, although a subclass of LWP::UserAgent, is much easier. I use it mainly for testing web applications with Test::More and Test::DatabaseRow, it works great.

In my LWP days, I always wished to have a way to describe a scraping in a file, and run a general perl script to execute that description, rather than coding for each case. I never did pursue that. Recently I started thinking about it again, now armed with WWW::Mechanize.

What I'm trying to do is to be able to describe a sequence of scraping as, for example:

<mechanize> <get url="" output="login. +html"/> <submit form_name="login" user="" passwd="" button="login" output= +"index.html"/> <get url=" Nodes" out +put="newest.html"/> </mechanize>
Then have a driver program to parse this and take the appropriate actions. The advantage is at least to avoid coding, and also to allow a non-perl or non-programmer to do scraping. The following is a very preliminary start (e.g., many commands hardcoded), the purpose to put it here is to first see whether something like this already exists, and to seek your advice/comments. For example, XML doesn't seem to be the right language here since scraping is not usually hierarchical, I'm using xml just to avoid doing my own parsing.

My simple driver program is as follows:

use strict; use WWW::Mechanize; use XML::SAX; use CmdHandler; if(!@ARGV){ print "Need to pass a input file name"; } my $agent = WWW::Mechanize->new; my $parser = XML::SAX::ParserFactory->parser( Handler => CmdHandler->ne +w($agent) ); $parser->parse_uri($ARGV[0]); exit(0);
Where the is as follows:
package CmdHandler; use strict; use base qw(XML::SAX::Base); sub new{ my $class = shift; my $self = $class->SUPER::new(); $self->_init(@_); return $self; } sub _init{ my ($self,$agent) = @_; $self->{agent} = $agent; } sub start_element{ my ($self,$el) = @_; my $name = $el->{Name}; print "Processing start_element:$name\n"; return if $name eq "mechanize"; # put all attributes in a hash, is there a better way? my %params = (); foreach my $k (values %{$el->{Attributes}}){ $params{$k->{Name}}=$k->{Value}; } # well, some ugly hardcoded if-else, a better way? if($name eq "get"){ $self->{agent}->get($params{url}); }elsif($name eq 'submit'){ $self->{agent}->submit(form_name=>$params{form_name}, button=>$params{button}, fields=>\%params); }elsif($name eq 'back'){ $self->{agent}->back(); }elsif($name eq 'follow_link'){ $self->{agent}->follow_link(n => $params{n}, text=>$params{text}, url_regex=>$params{url_regex}); }else{ print "Hey, don't know what you mean, may be in next version.\ +n"; } # may be we want to print out to a file? my $file = $params{output}; if($file){ if($file eq "stdout"){ print $self->{agent}->content(); }elsif($file eq "none"){ }else{ open(OUTPUT, ">$file") or warn "Can't open $file for writi +ng\n"; print OUTPUT $self->{agent}->content(); close(OUTPUT); } } return $self->SUPER::start_element($el); } 1;