Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

CGI to query other websites

by Shuraski (Scribe)
on Apr 19, 2010 at 23:50 UTC ( [id://835609]=perlquestion: print w/replies, xml ) Need Help??

Shuraski has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

Anyone have a preferred way to query a website using CGI?
Or perhaps can recommend a tutorial?

For example, I want to search multiple on-line websites for information about a specific gene, obtained from a user input form:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title> Basic Web Form </title> </head> <body> <h1> Basic Web Form Example </h1> <p> Please enter a gene name and click 'Search' to get a report of informa +tion for a gene. </p> <form action="/cgi-bin/gene_query.pl" method="get"> <p> Gene: <input type="text" name="gene" size="15" /> <input type="submit" value="Search" /> <input type="reset" value="Clear" /> </p> </form> </body> </html>

In gene_query.pl what is the best technique to search (for example), a website like Pubmed (http://www.ncbi.nlm.nih.gov/pubmed/) for all articles involving $gene?

These websites I assume have SQL databases as the back-end. Does one need to construct an SQL query for the db?

Thanks in advance for any pointers in the right direction.

Replies are listed 'Best First'.
Re: CGI to query other websites
by Marshall (Canon) on Apr 20, 2010 at 00:29 UTC
    In gene_query.pl what is the best technique to search (for example), a website like Pubmed (http://www.ncbi.nlm.nih.gov/pubmed/) for all articles involving $gene?

    When accessing a huge database like Entrez, the first thing is to look on their site for tools that they may have written to help you out - NCBI knows that there are automated things accessing their site and they've done some things to help you which helps them not get their site swamped. I would pay attention to their "rules of the polite road", e.g. heavy loads after business hours EST.

    Take a look at Entrez Programming Utilities. These folks have a special URL for automated stuff and they want you to use it rather than "scraping" their normal human URL pages. Read the previous link carefully, look at their Perl code and I think you will have a lot of success there. Oh, of course also look at their PubMed Home Page.

Re: CGI to query other websites
by pemungkah (Priest) on Apr 20, 2010 at 00:26 UTC
    First - let me clear up a possible misconception.

    When you talk to another website, you're generally going to have to use whatever interface they've supplied; you can't query their database directly. As an example, Yahoo! stores their data using a complex proprietary system, but you simply use http://search.yahoo.com/search?p=cats to search for cats. You don't (and can't) query their database. (This is for both reasons of security and complexity - letting an arbitrary person run queries against your database means you have to manage access and prevent DROP TABLES from happening - and to allow the database to be changed around without you having to care what it is or how it works.)

    You can probably do what you want to do, but you'll need to go to each site's search function and see how it's done, as in, what URL the search form lives at and what data needs to be supplied to do a search; you may find WWW::Mechanize very helpful, as it acts like a browser that you control via Perl. You can fill out a form and submit it, get the resulting HTML back, and then parse out the search results.

Re: CGI to query other websites
by tospo (Hermit) on Apr 20, 2010 at 11:11 UTC

    As Marshall pointed out, you should use the Entrez Programming Utilities.

    Entrez offers a so-called SOAP webservice. You will find lots of resources out there that use the SOAP protocol. The other type of service that you will frequently encounter is REST. It is well worth learning a bit about those two techniques - SOAP comes with a bit of a learning curve wheres REST is more intuitive.

    For SOAP, you should have a look at SOAP::Lite. To use a SOAP service, you need a description of what information you can send and retrieve and in what form, this is known as a "WSDL" (see below). To query REST services, you usually use something like LWP::Simple.

    A very useful resource for your purposes is http://www.biocatalogue.org, which lists available webservices and tells you something about the APIs they use and gives you links to the WSDLs. That would be a good place to search for relevant sources of information for your web tool.

Re: CGI to query other websites
by bradcathey (Prior) on Apr 20, 2010 at 00:34 UTC

    I use Google for searching multiple Web sites.

    Sorry, couldn't resist.

    Seriously, unless they offer an API of some kind, WWW::Mechanize may be your best bet other than scraping pages or creating your own spider.

    —Brad
    "The important work of moving the world forward does not wait to be done by perfect men." George Eliot
Re: CGI to query other websites
by furry_marmot (Pilgrim) on Apr 20, 2010 at 23:16 UTC

    Something that didn't get mentioned is that CGI runs on the server side. What you're seeing in the HTML, above, is code that gets executed on the server the web page came from. It can't execute anywhere else.

    If you want to run a server and create a database and let other people use it, then use CGI. But if you're trying to access someone else's information, forget about CGI.

    Depending on what you want to do, LWP::Simple works fine for many simple and even some complex websites. Not everything requires an API.

      The OP said that the goal is to make a website that allows users to gather data from other databases. So both CGI and a way of interacting with the third party sites is required.

        I can see that now. On first reading, I thought he was confusing front- and back-end.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://835609]
Approved by Marshall
Front-paged by Marshall
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-04-24 19:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found