Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Scrape "generated" content from secure site?

by hackerkatt (Initiate)
on Apr 26, 2009 at 20:10 UTC ( [id://760180]=perlquestion: print w/replies, xml ) Need Help??

hackerkatt has asked for the wisdom of the Perl Monks concerning the following question:

Greeings fellow coders, I'm somewhat new working with perl. I am attempting to automate pulling some information from a wholesale providers website who does not offer an API for it's retailers, i.e. scape some information from a page. I can successfully login to the https site and navigate to the page which should contain the content, but the "generated content" is missing. So as to not draw a false conclusion, to verify that there is content, I have logged into the site via the browser and indeed there is content. And I can "view source" and see the tabled information I ultimately want to get to via a script. Here is the code I use in my attempt. I've also played aroung with the LPW:UserAgent with less success. For obvious reasons, the site, user, and password has been changed. I am asking for any assistance one might have with much more experience with this. Thank you very much.
#!/usr/bin/perl -w use strict; use WWW::Mechanize; use HTTP::Cookies; my $outfile = "out.htm"; my $url = "https://www.mywholesalersite.com/frmLogin.aspx"; my $username = 'user'; my $password = 'passwd'; my $mech = WWW::Mechanize->new(); $mech->cookie_jar(HTTP::Cookies->new()); $mech->get($url); $mech->form_name('Form1'); $mech->field(login => $username); $mech->field(passwd => $password); $mech->click(); sleep 3; # We are now logged in. This is verified by viewing the $mech->conten +t $url = 'https://www.mywholesalersite.com/frmReportLineCount.aspx'; $mech->get( $url ); sleep 15; # Allow time for content to be generated $mech->save_content( $outfile ); exit;

Replies are listed 'Best First'.
Re: Scrape "generated" content from secure site?
by bart (Canon) on Apr 26, 2009 at 20:18 UTC
    I can successfully login to the https site and navigate to the page which should contain the content, but the "generated content" is missing.
    You sure it's not added in Javascript?

    You can test that by disabling Javascript in your browser, which is easy to do with Firefox if you have the NoScript extension, and see if it is there, or instead, missing.

      @bart - I tried the site w/o Javascript enabled. I indeed get the content.
Re: Scrape "generated" content from secure site?
by igelkott (Priest) on Apr 26, 2009 at 20:26 UTC
    username ... password

    Is authentication really on a simple form or is this a popup for the webserver? See Authorization in the Mechanize FAQ for help.

      @igelkott, No popup, i.e. simple. It is via two form elements submitted. If I login via my method. I do get access and $mech->get other pages from the site otherwise unavailable.
Re: Scrape "generated" content from secure site?
by Llew_Llaw_Gyffes (Scribe) on Apr 27, 2009 at 04:41 UTC
    Hackerkatt, have you tried using different User-Agent strings? Some sites will only deliver content to known browser user-agents as a measure to prevent robotic browsing or spidering.
Re: Scrape "generated" content from secure site?
by Gangabass (Vicar) on Apr 27, 2009 at 07:59 UTC

    Some sites use dirty tricks like setting some cookies when you GET some picture (or css file). So you must investigate server response in LiveHTTPHeaders for example.

    Also some site waiting for x and y coordinates when you click a button...

      @Gangbass, How can I be sure of the cookie I'm getting/saving is being used by Mech? I see in Firebug this cookie getting stored.
      151318053.1240690172.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
      And this script snipped from the page with the link to my final destination looks like it could have some bearing. I'm trying to save the minimized js for inspection, but having a problem getting all of it. Any thoughts on what you see here? Perhaps some common code used?
      <script type="text/javascript"> var pageTracker = _gat._getTracker("UA-3875979-1"); pageTracker._initData(); pageTracker._trackPageview(); </script>

        This is Google Analytics Tracking Code. So i don't think this is what you want...

        Early you said what this site work w/o JavaScript... Did you clear cookies before testing it?

        If this is JavaScript when you just need to check there it set cookies and implement this logic in your WWW::Mechanize.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://760180]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-19 07:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found