http://qs321.pair.com?node_id=909837


in reply to Parsing HTTP...

Ask yourself "what is different" about the two requests, from your browser, and from your Perl code. There are two classes of common reasons for differences:

  1. Differences in the request.
  2. Differences in the processing of the response document.

For (1), remember that the request is much more than the URL: a number of headers may be sent by your browser. Headers that commonly change behaviour include Cookie, User-Agent, Referer, but any header should be looked at. You can look at the headers by sniffing the network (Wireshark), a browser plugin (e.g. Firebug for Firefox) or a proxy (Fiddler, on Windows). LWP (if that is what you are using) allows you to change the headers of your request.

For (2), usually this is Javascript. The commonly-used Perl tools, LWP and derivatives (e.g. WWW::Mechanize) do not support Javascript. In most cases you can read the Javascript yourself and manually mimic what it is doing by further requests or Perl code. But there do seem to be some Perl modules floating around that claim Javascript capabilities, usually through a conventional browser; have a look on CPAN. You could also look at Selenium.

Finally, think laterally--perhaps you can get your data another way. The website you mention seems to have various XML feeds.

Replies are listed 'Best First'.
Re^2: Parsing HTTP...
by insectopalo (Initiate) on Jun 19, 2011 at 23:01 UTC

    Thank you all. I have been trying some of the alternatives, but it seems pretty difficult in general. However, using WWW::Mechanize::Firefox has (in practical terms) solved the issue. It looks gross though, to see the browser doing the dirty job.