Before I heard about
WWW::Mechanize, LWP was my favorite module set. I did lots of website scraping with it, mostly for fun (e.g., reading Yahoo Finance stock message boards in the bubble years, getting stats on eBay, etc.) Now I use WWW::Mechanize, which, although a subclass of
LWP::UserAgent, is much easier. I use it mainly for testing web applications with
Test::More and
Test::DatabaseRow, it works great.
In my LWP days, I always wished to have a way to describe a scraping in a file, and run a general perl script to execute that description, rather than coding for each case.
I never did pursue that. Recently I started thinking about it again, now armed with WWW::Mechanize.
What I'm trying to do is to be able to describe a sequence of scraping as, for example:
<mechanize>
<get url="http://perlmonks.org/index.pl?node=login" output="login.
+html"/>
<submit form_name="login" user="" passwd="" button="login" output=
+"index.html"/>
<get url="http://www.perlmonks.org/index.pl?node=Newest Nodes" out
+put="newest.html"/>
</mechanize>
Then have a driver program to parse this and take the appropriate actions. The advantage is at least to avoid coding, and also to allow a non-perl or non-programmer to do scraping. The following is a very preliminary start (e.g., many commands hardcoded), the purpose to put it here is to first see whether something like this already exists, and to seek your advice/comments. For example, XML doesn't seem to be the right language here since scraping is not usually hierarchical, I'm using xml just to avoid doing my own parsing.
My simple driver program is as follows:
use strict;
use WWW::Mechanize;
use XML::SAX;
use CmdHandler;
if(!@ARGV){
print "Need to pass a input file name";
}
my $agent = WWW::Mechanize->new;
my $parser = XML::SAX::ParserFactory->parser(
Handler => CmdHandler->ne
+w($agent)
);
$parser->parse_uri($ARGV[0]);
exit(0);
Where the CmdHandler.pm is as follows:
package CmdHandler;
use strict;
use base qw(XML::SAX::Base);
sub new{
my $class = shift;
my $self = $class->SUPER::new();
$self->_init(@_);
return $self;
}
sub _init{
my ($self,$agent) = @_;
$self->{agent} = $agent;
}
sub start_element{
my ($self,$el) = @_;
my $name = $el->{Name};
print "Processing start_element:$name\n";
return if $name eq "mechanize";
# put all attributes in a hash, is there a better way?
my %params = ();
foreach my $k (values %{$el->{Attributes}}){
$params{$k->{Name}}=$k->{Value};
}
# well, some ugly hardcoded if-else, a better way?
if($name eq "get"){
$self->{agent}->get($params{url});
}elsif($name eq 'submit'){
$self->{agent}->submit(form_name=>$params{form_name},
button=>$params{button},
fields=>\%params);
}elsif($name eq 'back'){
$self->{agent}->back();
}elsif($name eq 'follow_link'){
$self->{agent}->follow_link(n => $params{n},
text=>$params{text},
url_regex=>$params{url_regex});
}else{
print "Hey, don't know what you mean, may be in next version.\
+n";
}
# may be we want to print out to a file?
my $file = $params{output};
if($file){
if($file eq "stdout"){
print $self->{agent}->content();
}elsif($file eq "none"){
}else{
open(OUTPUT, ">$file") or warn "Can't open $file for writi
+ng\n";
print OUTPUT $self->{agent}->content();
close(OUTPUT);
}
}
return $self->SUPER::start_element($el);
}
1;
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.