Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Before I heard about WWW::Mechanize, LWP was my favorite module set. I did lots of website scraping with it, mostly for fun (e.g., reading Yahoo Finance stock message boards in the bubble years, getting stats on eBay, etc.) Now I use WWW::Mechanize, which, although a subclass of LWP::UserAgent, is much easier. I use it mainly for testing web applications with Test::More and Test::DatabaseRow, it works great.

In my LWP days, I always wished to have a way to describe a scraping in a file, and run a general perl script to execute that description, rather than coding for each case. I never did pursue that. Recently I started thinking about it again, now armed with WWW::Mechanize.

What I'm trying to do is to be able to describe a sequence of scraping as, for example:

<mechanize> <get url="http://perlmonks.org/index.pl?node=login" output="login. +html"/> <submit form_name="login" user="" passwd="" button="login" output= +"index.html"/> <get url="http://www.perlmonks.org/index.pl?node=Newest Nodes" out +put="newest.html"/> </mechanize>
Then have a driver program to parse this and take the appropriate actions. The advantage is at least to avoid coding, and also to allow a non-perl or non-programmer to do scraping. The following is a very preliminary start (e.g., many commands hardcoded), the purpose to put it here is to first see whether something like this already exists, and to seek your advice/comments. For example, XML doesn't seem to be the right language here since scraping is not usually hierarchical, I'm using xml just to avoid doing my own parsing.

My simple driver program is as follows:

use strict; use WWW::Mechanize; use XML::SAX; use CmdHandler; if(!@ARGV){ print "Need to pass a input file name"; } my $agent = WWW::Mechanize->new; my $parser = XML::SAX::ParserFactory->parser( Handler => CmdHandler->ne +w($agent) ); $parser->parse_uri($ARGV[0]); exit(0);
Where the CmdHandler.pm is as follows:
package CmdHandler; use strict; use base qw(XML::SAX::Base); sub new{ my $class = shift; my $self = $class->SUPER::new(); $self->_init(@_); return $self; } sub _init{ my ($self,$agent) = @_; $self->{agent} = $agent; } sub start_element{ my ($self,$el) = @_; my $name = $el->{Name}; print "Processing start_element:$name\n"; return if $name eq "mechanize"; # put all attributes in a hash, is there a better way? my %params = (); foreach my $k (values %{$el->{Attributes}}){ $params{$k->{Name}}=$k->{Value}; } # well, some ugly hardcoded if-else, a better way? if($name eq "get"){ $self->{agent}->get($params{url}); }elsif($name eq 'submit'){ $self->{agent}->submit(form_name=>$params{form_name}, button=>$params{button}, fields=>\%params); }elsif($name eq 'back'){ $self->{agent}->back(); }elsif($name eq 'follow_link'){ $self->{agent}->follow_link(n => $params{n}, text=>$params{text}, url_regex=>$params{url_regex}); }else{ print "Hey, don't know what you mean, may be in next version.\ +n"; } # may be we want to print out to a file? my $file = $params{output}; if($file){ if($file eq "stdout"){ print $self->{agent}->content(); }elsif($file eq "none"){ }else{ open(OUTPUT, ">$file") or warn "Can't open $file for writi +ng\n"; print OUTPUT $self->{agent}->content(); close(OUTPUT); } } return $self->SUPER::start_element($el); } 1;

In reply to To mechanize WWW::Mechanize: a scraping language? by johnnywang

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2023-04-02 02:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?