Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Web Scraping on CGI Scripts?

by fraizerangus (Sexton)
on Oct 09, 2011 at 19:41 UTC ( [id://930488]=perlquestion: print w/replies, xml ) Need Help??

fraizerangus has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks I need to web scrape over an online datadase which uses CGI scripts and all of the scraping or browsing tools work better for HTML code, does anyone have any ideas or advice!? best wishes and many thanks Dan

Replies are listed 'Best First'.
Re: Web Scraping on CGI Scripts?
by Corion (Patriarch) on Oct 09, 2011 at 19:49 UTC

    Do you have access to the web server machine itself?

    If you are talking about web scraping, it behooves you well to understand where HTML files and CGI files, and also from which point on there is no difference between the two.

    Personally, I use WWW::Mechanize, or Web::Scraper, or App::scrape. Others have reported good success using Mojolicious. Which one of these have you tried and how did they fail for you?

      Hi Corion

      Unfortunately I don't have access to the server machine itself, I'm doing it over the internet entirely. I've been using Web::Scraper which works really well with properly formatted HTML but not with CGI. I'm trying to get the URL and name of entries in the table. For example the code I've been using is:

      #!/usr/bin/perl use warnings; use strict; use URI; use Web::Scraper; open FILE, ">file.txt" or die $!; # website to scrape my $urlToScrape = "http://www.molmovdb.org/cgi-bin/browse.cgi"; # prepare data my $teamsdata = scraper { # we will save the urls from the teams process "tr.cell2> A", 'urls[]' => '@href'; # we will save the team names process "tr.cell2> A", 'teams[]' => 'TEXT'; }; # scrape the data my $res = $teamsdata->scrape(URI->new($urlToScrape));

      Please follow the link given in the code and open its page source for an example? best wishes Dan

        There is no conceptual difference between "CGI" and "HTML" when they are accessed over HTTP. If Web::Scraper fails, maybe you can tell us how it fails and where.

Re: Web Scraping on CGI Scripts?
by tospo (Hermit) on Oct 10, 2011 at 08:16 UTC
    If you are scraping a web page then it will be HTML. Or are you trying to parse output from a web service that sends a response in something like XML or JSON format? There are modules to handle these scenarios but it is important first to know what you are dealing with. Can you be more precise and maybe give a URL?
      Hi Tospo The URL is http://www.molmovdb.org/cgi-bin/browse.cgi I'm trying to follow all the links to the database enties iteratively and output these as text files to analyse later as you can probably see the coding is not formatted amazingly well!? many thanks and best wishes Dan
        That page - apart from being marked-up in a rather old-fashioned way - isn't too bad at all. If you look at the page source code, you can easily see a table structure that you can use to parse it.
        You will want to use a module like WWW::Mechanize to interact with the website. This moduel allows you to interact with web content like a user would in a browser. You can make your script "click" on links, to get to the text files. Use the table structure of the "browse" page to iterate over all the molecules, each time following the link through to the text data files.
        Have a go with a simple example first. There are a few here. If you are getting stuck, post the script you have so far and what's happening so we can help you along. Good luck!
        oh and I forgot to mention: you are always parsing the HTML output that the server sends to you. It doesn't matter that this is a cgi script generating the page on the server, the output is just HTML (unless it's a webservice that sends XML, JSON or the like). So there is nothing special about this case.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://930488]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2024-03-28 11:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found