Retrieving web pages with the LWP::UserAgent

mrguy123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks
I am trying to retrieve a certain web page with search results with the LWP::UserAgent. I took the URL query from this web site after conducting a search for cats and changing the form from POST to GET, using Firefox. This is the code:

#!/exlibris/metalib/m4_b/product/bin/perl
use strict;
use LWP::UserAgent;
{
    my $ua = new LWP::UserAgent();
    my $search_address = "http://www.stat-usa.gov/nct_all.nsf/2d58b7a
34bbaa3838525703f004f804e?";
    my $post = 'qAllWords=cats&qAnyWords=&qNoWords=&PostedSince=01%2F0
+1%2F2003&webcat_select=All&databases=ATL&databases=AGWORLD&databases=
+MRD_CCG&databases=DLA&databases=CBD&databases=MRD_ISA&databases=MRD_I
+MI&databases=MISCFILES&databases=MRD_ALL&databases=MRD_MDB&databases=
+NED&databases=PUB&databases=ONLINE&databases=TOP&databases=ETO_DE&dat
+abases=ETO_OF&configserver=CN%3Dstatweb01%2FOU%3Dwebserv%2FOU%3Dstate
+sa%2FO%3Dstatdoc&configpath=nct_config5.nsf&webcategories=All&header=
+&footer=&disp_header=&disp_footer=&saveoptions=0&query=AND+%28%5BdUpd
+ate%5D+%3E+01%2F01%2F2003%29';
    
    #creating the request object
    my $header = new HTTP::Headers();
    my $req = new HTTP::Request ('POST', $search_address, $header, $po
st);
    #sending the request
    my $res = $ua->request($req);
    if (!($res->is_success)){
        warn "Warning:".$res->message."\n";
    }
   
    print $res->as_string."\n";
}
[download]

However, when I run this program, instead of getting the search results, I get the search page. Usually when this happens it is because of a cookie, but I couldn't find any cookies in this site. Does anybody know what the problem might be, and how I can solve it by retrieving the search results page?
Much obliged
Guy Naamati

"A truth that's told with bad intent beats all then lies you can invent"

Comment on Retrieving web pages with the LWP::UserAgent Download Code

Replies are listed 'Best First'.
Re: Retrieving web pages with the LWP::UserAgent by b10m (Vicar) on Sep 06, 2006 at 13:26 UTC
When I convert the POST to a GET, I get an error: Error 400: HTTP Web Server: Unknown Command Exception So my guess is that they really want you to POST your data. For such tasks, WWW::Mechanize is usually my preferred choice, for it makes stuff so easy. A sample script like this would get you started: `use strict; use WWW::Mechanize; my $mech = new WWW::Mechanize; $mech->get('http://www.stat-usa.gov/nct_all.nsf/Search'); $mech->submit_form( form_name => '_Search', fields => { Query => 'your search term', } ); print $mech->content;` [download] -- b10m All code is usually tested, but rarely trusted.	[reply] [d/l]
Re^2: Retrieving web pages with the LWP::UserAgent by mrguy123 (Hermit) on Sep 06, 2006 at 13:32 UTC
Thanks for the advice. Does it work similarly to the LWP::UserAgent?	[reply]
Re: Retrieving web pages with the LWP::UserAgent by b10m (Vicar) on Sep 06, 2006 at 13:35 UTC
It uses LWP::UserAgent, yes, only (as you can see), it a lot easier to work with. The code above is basically all you need ;-) -- b10m All code is usually tested, but rarely trusted.	[reply]
Re: Retrieving web pages with the LWP::UserAgent by davorg (Chancellor) on Sep 06, 2006 at 13:15 UTC
Could be any number of things. My two best guesses would be: Maybe the form processer only accepts POSTS. Why not try POSTing the request instead. The "2d58b7a34bbaa3838525703f004f804e" part of your URL looks like it might be a session ID. Perhaps that session has expired. Another useful tip in situations like this is to install Firefox's LiveHTTPHeaders extension and to see exactly what the HTTP interaction is. You might be missing important headers. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re^2: Retrieving web pages with the LWP::UserAgent by bart (Canon) on Sep 07, 2006 at 09:28 UTC
Your second idea, about the session ID, was one worth pursuing. So I tried the URL manually, and I got a search page. I tried removing the "session ID" and I got a page with just 2 links: to a plain search page, and to an advanced search page. Apparently it's the latter the OP has been using, and its canonical URL is http://www.stat-usa.gov/nct_all.nsf/advSearch. And when I looked in this page's source, the form's action attribute was `/nct_all.nsf/2d58b7a34bbaa3838525703f004f804e?CreateDocument`: the exact same strange weird ID. So no, apparently it's not variable, but likely, generated by their web site creation tool. Do note the part after the question mark: "`CreateDocument`". I propose the OP tries it using POST with this part appended — and obviously, this wouldn't work with GET. I did try the OP's code as posted at this time, with just this changed (and the broken up words reassembled), and it works for me.	[reply] [d/l] [select]
Re^2: Retrieving web pages with the LWP::UserAgent by mrguy123 (Hermit) on Sep 06, 2006 at 13:30 UTC
Sorry about the 'GET', it should have been 'POST', althought the result is the same. I tried it with newer session ids, and got the same result. I will use your advice for the HTTP headers. Do you know if there is another way that a website stores info besides cookies and session IDs?	[reply]
Re: Retrieving web pages with the LWP::UserAgent by Fletch (Bishop) on Sep 06, 2006 at 13:18 UTC
I'm going to guess that that long hex string is some sort of session identifier and you're probably missing a temporary cookie that's set to expire when your browser quits, or perhaps coming from a different machine than the session was started from. In either case, you're probably going to be better off using WWW::Mechanize to go through the site's login page then navigate to whatever data you're trying to retrieve. Or see if they don't have some sort of SOAP/XMLRPC/REST-y interface for queries that you may can hit directly.	[reply]
Re: Retrieving web pages with the LWP::UserAgent by planetscape (Chancellor) on Sep 07, 2006 at 16:17 UTC
I heartily second the advice given you by both Fletch and b10m: use WWW::Mechanize. More specific to your stated problem: I would try using a module such as HTTP::Recorder or WWW::Mechanize::Shell to record a successful manual form submission. The output of HTTP::Recorder, for instance, can be "dropped" right into your WWW::Mechanize scripts. (leira's article "Web Testing with HTTP::Recorder" contains an excellent example of how you might want to do this.) Another important tool for finding out what is really happening behind the scenes between server and browser is a protocol analyzer such as Ethereal. HTH, planetscape	[reply]
Re^2: Retrieving web pages with the LWP::UserAgent by RyuMaou (Deacon) on Dec 30, 2014 at 14:39 UTC
Thank you for pointing out HTTP::Recorder! This looks like exactly the solution to a problem I've been having getting a script to collect multiple pages out of some search results. Thank you!	[reply]


Perl Monk, Perl Meditation
	PerlMonks