Before I heard about WWW::Mechanize, LWP was my favorite module set. I did lots of website scraping with it, mostly for fun (e.g., reading Yahoo Finance stock message boards in the bubble years, getting stats on eBay, etc.) Now I use WWW::Mechanize, which, although a subclass of LWP::UserAgent, is much easier. I use it mainly for testing web applications with Test::More and Test::DatabaseRow, it works great.
In my LWP days, I always wished to have a way to describe a scraping in a file, and run a general perl script to execute that description, rather than coding for each case. I never did pursue that. Recently I started thinking about it again, now armed with WWW::Mechanize.
What I'm trying to do is to be able to describe a sequence of scraping as, for example:
Then have a driver program to parse this and take the appropriate actions. The advantage is at least to avoid coding, and also to allow a non-perl or non-programmer to do scraping. The following is a very preliminary start (e.g., many commands hardcoded), the purpose to put it here is to first see whether something like this already exists, and to seek your advice/comments. For example, XML doesn't seem to be the right language here since scraping is not usually hierarchical, I'm using xml just to avoid doing my own parsing.<mechanize> <get url="http://perlmonks.org/index.pl?node=login" output="login. +html"/> <submit form_name="login" user="" passwd="" button="login" output= +"index.html"/> <get url="http://www.perlmonks.org/index.pl?node=Newest Nodes" out +put="newest.html"/> </mechanize>
My simple driver program is as follows:
Where the CmdHandler.pm is as follows:use strict; use WWW::Mechanize; use XML::SAX; use CmdHandler; if(!@ARGV){ print "Need to pass a input file name"; } my $agent = WWW::Mechanize->new; my $parser = XML::SAX::ParserFactory->parser( Handler => CmdHandler->ne +w($agent) ); $parser->parse_uri($ARGV[0]); exit(0);
package CmdHandler; use strict; use base qw(XML::SAX::Base); sub new{ my $class = shift; my $self = $class->SUPER::new(); $self->_init(@_); return $self; } sub _init{ my ($self,$agent) = @_; $self->{agent} = $agent; } sub start_element{ my ($self,$el) = @_; my $name = $el->{Name}; print "Processing start_element:$name\n"; return if $name eq "mechanize"; # put all attributes in a hash, is there a better way? my %params = (); foreach my $k (values %{$el->{Attributes}}){ $params{$k->{Name}}=$k->{Value}; } # well, some ugly hardcoded if-else, a better way? if($name eq "get"){ $self->{agent}->get($params{url}); }elsif($name eq 'submit'){ $self->{agent}->submit(form_name=>$params{form_name}, button=>$params{button}, fields=>\%params); }elsif($name eq 'back'){ $self->{agent}->back(); }elsif($name eq 'follow_link'){ $self->{agent}->follow_link(n => $params{n}, text=>$params{text}, url_regex=>$params{url_regex}); }else{ print "Hey, don't know what you mean, may be in next version.\ +n"; } # may be we want to print out to a file? my $file = $params{output}; if($file){ if($file eq "stdout"){ print $self->{agent}->content(); }elsif($file eq "none"){ }else{ open(OUTPUT, ">$file") or warn "Can't open $file for writi +ng\n"; print OUTPUT $self->{agent}->content(); close(OUTPUT); } } return $self->SUPER::start_element($el); } 1;
Back to
Meditations