Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Retrieving web pages with the LWP::UserAgent

by mrguy123 (Hermit)
on Sep 06, 2006 at 13:08 UTC ( [id://571448]=perlquestion: print w/replies, xml ) Need Help??

mrguy123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks
I am trying to retrieve a certain web page with search results with the LWP::UserAgent. I took the URL query from this web site after conducting a search for cats and changing the form from POST to GET, using Firefox. This is the code:
#!/exlibris/metalib/m4_b/product/bin/perl use strict; use LWP::UserAgent; { my $ua = new LWP::UserAgent(); my $search_address = "http://www.stat-usa.gov/nct_all.nsf/2d58b7a 34bbaa3838525703f004f804e?"; my $post = 'qAllWords=cats&qAnyWords=&qNoWords=&PostedSince=01%2F0 +1%2F2003&webcat_select=All&databases=ATL&databases=AGWORLD&databases= +MRD_CCG&databases=DLA&databases=CBD&databases=MRD_ISA&databases=MRD_I +MI&databases=MISCFILES&databases=MRD_ALL&databases=MRD_MDB&databases= +NED&databases=PUB&databases=ONLINE&databases=TOP&databases=ETO_DE&dat +abases=ETO_OF&configserver=CN%3Dstatweb01%2FOU%3Dwebserv%2FOU%3Dstate +sa%2FO%3Dstatdoc&configpath=nct_config5.nsf&webcategories=All&header= +&footer=&disp_header=&disp_footer=&saveoptions=0&query=AND+%28%5BdUpd +ate%5D+%3E+01%2F01%2F2003%29'; #creating the request object my $header = new HTTP::Headers(); my $req = new HTTP::Request ('POST', $search_address, $header, $po st); #sending the request my $res = $ua->request($req); if (!($res->is_success)){ warn "Warning:".$res->message."\n"; } print $res->as_string."\n"; }
However, when I run this program, instead of getting the search results, I get the search page. Usually when this happens it is because of a cookie, but I couldn't find any cookies in this site. Does anybody know what the problem might be, and how I can solve it by retrieving the search results page?
Much obliged
Guy Naamati

"A truth that's told with bad intent beats all then lies you can invent"

Replies are listed 'Best First'.
Re: Retrieving web pages with the LWP::UserAgent
by b10m (Vicar) on Sep 06, 2006 at 13:26 UTC

    When I convert the POST to a GET, I get an error:

    Error 400: 
    
    HTTP Web Server: Unknown Command Exception

    So my guess is that they really want you to POST your data.

    For such tasks, WWW::Mechanize is usually my preferred choice, for it makes stuff so easy. A sample script like this would get you started:

    use strict; use WWW::Mechanize; my $mech = new WWW::Mechanize; $mech->get('http://www.stat-usa.gov/nct_all.nsf/Search'); $mech->submit_form( form_name => '_Search', fields => { Query => 'your search term', } ); print $mech->content;
    --
    b10m

    All code is usually tested, but rarely trusted.
      Thanks for the advice. Does it work similarly to the LWP::UserAgent?

        It uses LWP::UserAgent, yes, only (as you can see), it a lot easier to work with. The code above is basically all you need ;-)

        --
        b10m

        All code is usually tested, but rarely trusted.
Re: Retrieving web pages with the LWP::UserAgent
by davorg (Chancellor) on Sep 06, 2006 at 13:15 UTC

    Could be any number of things. My two best guesses would be:

    • Maybe the form processer only accepts POSTS. Why not try POSTing the request instead.
    • The "2d58b7a34bbaa3838525703f004f804e" part of your URL looks like it might be a session ID. Perhaps that session has expired.

    Another useful tip in situations like this is to install Firefox's LiveHTTPHeaders extension and to see exactly what the HTTP interaction is. You might be missing important headers.

    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      Your second idea, about the session ID, was one worth pursuing. So I tried the URL manually, and I got a search page. I tried removing the "session ID" and I got a page with just 2 links: to a plain search page, and to an advanced search page. Apparently it's the latter the OP has been using, and its canonical URL is http://www.stat-usa.gov/nct_all.nsf/advSearch.

      And when I looked in this page's source, the form's action attribute was /nct_all.nsf/2d58b7a34bbaa3838525703f004f804e?CreateDocument: the exact same strange weird ID. So no, apparently it's not variable, but likely, generated by their web site creation tool.

      Do note the part after the question mark: "CreateDocument". I propose the OP tries it using POST with this part appended — and obviously, this wouldn't work with GET.

      I did try the OP's code as posted at this time, with just this changed (and the broken up words reassembled), and it works for me.

      Sorry about the 'GET', it should have been 'POST', althought the result is the same.
      I tried it with newer session ids, and got the same result.
      I will use your advice for the HTTP headers.
      Do you know if there is another way that a website stores info besides cookies and session IDs?
Re: Retrieving web pages with the LWP::UserAgent
by Fletch (Bishop) on Sep 06, 2006 at 13:18 UTC

    I'm going to guess that that long hex string is some sort of session identifier and you're probably missing a temporary cookie that's set to expire when your browser quits, or perhaps coming from a different machine than the session was started from.

    In either case, you're probably going to be better off using WWW::Mechanize to go through the site's login page then navigate to whatever data you're trying to retrieve. Or see if they don't have some sort of SOAP/XMLRPC/REST-y interface for queries that you may can hit directly.

Re: Retrieving web pages with the LWP::UserAgent
by planetscape (Chancellor) on Sep 07, 2006 at 16:17 UTC
      Thank you for pointing out HTTP::Recorder! This looks like exactly the solution to a problem I've been having getting a script to collect multiple pages out of some search results. Thank you!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://571448]
Approved by Samy_rio
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2024-04-20 03:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found