Hello Monks,
I've developed some WWW::Mechanize scripts that work without any problems, but now I am trying to route these scripts through a proxy (Privoxy v3.0.5 Beta running on my local Linux machine) and I'm finding that for all HTTPS requests, I always get 500 response codes. If I remove the Privoxy proxy from my Perl script, everything works. If I keep the proxy and just go to HTTP sites, everything works. If I configure my browser to go through the proxy and load an HTTPS page, everything works.
So to summarize so far...
HTTP request from WWW::Mechanize -> Privoxy => works!
HTTPS request from browser -> Privoxy => works!
HTTPS request from WWW::Mechanize -> direct connection to internet => works!
HTTPS request from WWW::Mechanize -> Privoxy => does NOT work!
Looking at Privoxy's detailed log file, the first sign of things going wrong appears to be that WWW::Mechanize passes a GET request to the proxy. The browsers do not do this, they use CONNECT I really don't know for sure if this is correct since CONNECT isn't really specified in the W3 HTTP 1.1 spec that I Googled.
My hypothesis is that Firefox has got it right and that WWW::Mechanize is not smart enough to use CONNECT instead of GET when requesting HTTPS pages throught a proxy.
My questions to the group are...
1) Does all of this sound right?
2) How would I force a CONNECT from either WWW::Mechanize or LWP in this cirumstance? Nothing is mentioned in any of the docs I've seen. Grepping the code didn't reveal anything to me either.
Here's my code....
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use HTTP::Cookies;
use LWP;
use LWP::DebugFile;
require HTTP::Request;
sub main {
my $cookie_jar = HTTP::Cookies->new(
file => 'cookies.dat',
autosave => 1,
hide_cookie2 => 1
);
my $bot = WWW::Mechanize->new;
$bot->max_redirect(100);
$bot->cookie_jar($cookie_jar);
$bot->proxy(['http', 'https'], 'http://192.168.250.11:8118/');
$bot->agent('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.
+8.0.3) Gecko/20060426 Firefox/1.5.0.3');
my $url = "https://login.yahoo.com";
#my $url = "https://us.etrade.com";
my $response = $bot->get($url);
my $content = $bot->content;
}
&main
Here is the Privoxy log looks like when I use my perl script...
Dec 09 22:53:30 Privoxy(b7f856c0) Info: Privoxy version 3.0.5
Dec 09 22:53:30 Privoxy(b7f856c0) Info: Program name: ./privoxy
Dec 09 22:53:30 Privoxy(b7f856c0) Info: Listening on port 8118 on IP a
+ddress 192.168.250.11
Dec 09 22:53:44 Privoxy(b7f84bb0) Header: New HTTP Request-Line: GET /
+ HTTP/1.0
Dec 09 22:53:44 Privoxy(b7f84bb0) Header: scan: GET / HTTP/1.0
Dec 09 22:53:44 Privoxy(b7f84bb0) Header: scan: Accept-Encoding: ident
+ity
Dec 09 22:53:44 Privoxy(b7f84bb0) Header: scan: Host: login.yahoo.com
Dec 09 22:53:44 Privoxy(b7f84bb0) Header: scan: User-Agent: Mozilla/5.
+0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Fire
+fox/1.5.0.3
Dec 09 22:53:44 Privoxy(b7f84bb0) Header: addh-unique: Host: login.yah
+oo.com
Dec 09 22:53:44 Privoxy(b7f84bb0) Header: Adding: Connection: close
Dec 09 22:53:44 Privoxy(b7f84bb0) Request: login.yahoo.com/
Dec 09 22:53:44 Privoxy(b7f84bb0) Writing: �Dec 09 22:53:45 Pri
+voxy(b7f84bb0) Writing: GET / HTTP/1.0
Accept-Encoding: identity
Host: login.yahoo.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3
+) Gecko/20060426 Firefox/1.5.0.3
Connection: close
Dec 09 22:53:47 Privoxy(b7f84bb0) Header: Adding: Connection: close
Dec 09 22:53:47 Privoxy(b7f84bb0) Writing: Connection: close
Here is the LWP debug information generated...
# LWP::DebugFile logging to lwp_457baef8_5876.log
# Time now: {1165733624} = Sat Dec 9 22:53:44 2006
LWP::UserAgent::new: ()
LWP::UserAgent::proxy: ARRAY(0x8ce0c98) http://192.168.250.11:8118/
LWP::UserAgent::proxy: http http://192.168.250.11:8118/
LWP::UserAgent::proxy: https http://192.168.250.11:8118/
LWP::UserAgent::request: ()
HTTP::Cookies::add_cookie_header: Checking login.yahoo.com for cookies
HTTP::Cookies::add_cookie_header: Checking .yahoo.com for cookies
HTTP::Cookies::add_cookie_header: Checking yahoo.com for cookies
HTTP::Cookies::add_cookie_header: Checking .com for cookies
LWP::UserAgent::send_request: GET https://login.yahoo.com
LWP::UserAgent::_need_proxy: Proxied to http://192.168.250.11:8118/
LWP::Protocol::http10::request: ()
LWP::Protocol::http10::request: S>0 "GET https://login.yahoo.com HTTP/
+1.0\x0D\x0A"
LWP::Protocol::http10::request: S>+ "Accept-Encoding: identity\x0D\x0A
+"
LWP::Protocol::http10::request: S>+ "Host: login.yahoo.com\x0D\x0A"
LWP::Protocol::http10::request: S>+ "User-Agent: Mozilla/5.0 (Windows;
+ U; Windows NT 5.1; en-US; rv:1.
8.0.3) Gecko/20060426 Firefox/1.5.0.3\x0D\x0A\x0D\x0A"
LWP::Protocol::http10::request: reading response
# Time now: {1165733627} = Sat Dec 9 22:53:47 2006
LWP::Protocol::http10::request: S>0 "Connection: close\x0D\x0A\x0D\x0A
+"
LWP::Protocol::http10::request: HTTP/0.9 assume OK
LWP::Protocol::collect: read 21 bytes
LWP::UserAgent::request: Simple response: OK
Here is what the Privoxy log file looks like when a browser (Firefox in this case) requests Yahoo's login page through the proxy...
Dec 09 22:56:06 Privoxy(b7f84bb0) Header: scan: CONNECT login.yahoo.co
+m:443 HTTP/1.1
Dec 09 22:56:06 Privoxy(b7f84bb0) Header: scan: User-Agent: Mozilla/5.
+0 (X11; U; Linux i686; en-US; rv:1.8.0.8) Gecko/20061109 CentOS/1.5.0
+.8-0.1.el4.centos4 Firefox/1.5.0.8 pango-text
Dec 09 22:56:06 Privoxy(b7f84bb0) Header: scan: Proxy-Connection: keep
+-alive
Dec 09 22:56:06 Privoxy(b7f84bb0) Header: scan: Host: login.yahoo.com
Dec 09 22:56:06 Privoxy(b7f84bb0) Header: crumble crunched: Proxy-Conn
+ection: keep-alive!
Dec 09 22:56:06 Privoxy(b7f84bb0) Header: addh-unique: Host: login.yah
+oo.com:443
Dec 09 22:56:06 Privoxy(b7f84bb0) Header: Adding: Connection: close
Dec 09 22:56:06 Privoxy(b7f84bb0) Request: login.yahoo.com:443/
Dec 09 22:56:06 Privoxy(b7f84bb0) Writing: �Dec 09 22:56:09 Pri
+voxy(b7f84bb0) Writing: HTTP/1.0 200 Connection established
Proxy-Agent: Privoxy/3.0.5
(...encrypted traffic follws.)
I am a real loss for what to do next, any help would
greatly be appreciated. So many sites have enctypted login pages that this impact almost all of the sites that I want to automate.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.