Scrape "generated" content from secure site?

hackerkatt has asked for the wisdom of the Perl Monks concerning the following question:

Greeings fellow coders, I'm somewhat new working with perl. I am attempting to automate pulling some information from a wholesale providers website who does not offer an API for it's retailers, i.e. scape some information from a page. I can successfully login to the https site and navigate to the page which should contain the content, but the "generated content" is missing. So as to not draw a false conclusion, to verify that there is content, I have logged into the site via the browser and indeed there is content. And I can "view source" and see the tabled information I ultimately want to get to via a script. Here is the code I use in my attempt. I've also played aroung with the LPW:UserAgent with less success. For obvious reasons, the site, user, and password has been changed. I am asking for any assistance one might have with much more experience with this. Thank you very much.

#!/usr/bin/perl -w

use strict;
use WWW::Mechanize;
use HTTP::Cookies;

my $outfile = "out.htm";
my $url = "https://www.mywholesalersite.com/frmLogin.aspx";
my $username = 'user';
my $password = 'passwd';
my $mech = WWW::Mechanize->new();

$mech->cookie_jar(HTTP::Cookies->new());
$mech->get($url);
$mech->form_name('Form1');
$mech->field(login => $username);
$mech->field(passwd => $password);
$mech->click();
sleep 3;

# We are now logged in.  This is verified by viewing the $mech->conten
+t

$url = 'https://www.mywholesalersite.com/frmReportLineCount.aspx';
$mech->get( $url );

sleep 15;  # Allow time for content to be generated

$mech->save_content( $outfile );

exit;
[download]

Comment on Scrape "generated" content from secure site? Download Code

Replies are listed 'Best First'.
Re: Scrape "generated" content from secure site? by bart (Canon) on Apr 26, 2009 at 20:18 UTC
I can successfully login to the https site and navigate to the page which should contain the content, but the "generated content" is missing. You sure it's not added in Javascript? You can test that by disabling Javascript in your browser, which is easy to do with Firefox if you have the NoScript extension, and see if it is there, or instead, missing.	[reply]
Re^2: Scrape "generated" content from secure site? by hackerkatt (Initiate) on Apr 26, 2009 at 21:39 UTC
@bart - I tried the site w/o Javascript enabled. I indeed get the content.	[reply]
Re: Scrape "generated" content from secure site? by igelkott (Priest) on Apr 26, 2009 at 20:26 UTC
username ... password Is authentication really on a simple form or is this a popup for the webserver? See Authorization in the Mechanize FAQ for help.	[reply]
Re^2: Scrape "generated" content from secure site? by hackerkatt (Initiate) on Apr 26, 2009 at 21:42 UTC
@igelkott, No popup, i.e. simple. It is via two form elements submitted. If I login via my method. I do get access and $mech->get other pages from the site otherwise unavailable.	[reply]
Re: Scrape "generated" content from secure site? by Llew_Llaw_Gyffes (Scribe) on Apr 27, 2009 at 04:41 UTC
Hackerkatt, have you tried using different User-Agent strings? Some sites will only deliver content to known browser user-agents as a measure to prevent robotic browsing or spidering.	[reply]
Re: Scrape "generated" content from secure site? by Gangabass (Vicar) on Apr 27, 2009 at 07:59 UTC
Some sites use dirty tricks like setting some cookies when you GET some picture (or css file). So you must investigate server response in LiveHTTPHeaders for example. Also some site waiting for x and y coordinates when you click a button...	[reply]
Re^2: Scrape "generated" content from secure site? by hackerkatt (Initiate) on Apr 27, 2009 at 12:16 UTC
@Gangbass, How can I be sure of the cookie I'm getting/saving is being used by Mech? I see in Firebug this cookie getting stored. `151318053.1240690172.1.1.utmcsr=(direct)\|utmccn=(direct)\|utmcmd=(none)` [download] And this script snipped from the page with the link to my final destination looks like it could have some bearing. I'm trying to save the minimized js for inspection, but having a problem getting all of it. Any thoughts on what you see here? Perhaps some common code used? `<script type="text/javascript"> var pageTracker = _gat._getTracker("UA-3875979-1"); pageTracker._initData(); pageTracker._trackPageview(); </script>` [download]	[reply] [d/l] [select]
Re^3: Scrape "generated" content from secure site? by Gangabass (Vicar) on Apr 27, 2009 at 12:57 UTC
This is Google Analytics Tracking Code. So i don't think this is what you want...	[reply]
Re^3: Scrape "generated" content from secure site? by Gangabass (Vicar) on Apr 27, 2009 at 12:46 UTC
Early you said what this site work w/o JavaScript... Did you clear cookies before testing it? If this is JavaScript when you just need to check there it set cookies and implement this logic in your WWW::Mechanize.	[reply]
Re^4: Scrape "generated" content from secure site? by Anonymous Monk on Apr 27, 2009 at 14:54 UTC


We don't bite newbies here... much
	PerlMonks