Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

LWP simple question

by InterGuru (Sexton)
on May 24, 2007 at 03:38 UTC ( [id://617155]=perlquestion: print w/replies, xml ) Need Help??

InterGuru has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to upload an Amazon page for screen-scraping The page I get with LWP::Simple; is different than the page I get from pasting the URL (http://www.amazon.com/exec/obidos/ASIN/0394756673/ref=nosim/bookreadersre-20 ) into a browser. The page from the browser has a string "offer-listing" in the HTML source, the page from lwp does not.

I have tried being logged in or out of Amazon to see if that would cause a change. No difference.

Here is the code.

#!/usr/bin/perl use strict; use LWP::Simple; print "test_get.pl\n"; my $amazon_url = q{http://www.amazon.com/exec/obidos/ASIN/0394756673/ref=nosim/bookread +ersre-20}; my $page = get ($amazon_url); open FILE , '>temp2' or die "Cannot open temp2\n"; print FILE $page; my $sought_string = q{offer-listing}; if ($page =~ /$sought_string/){ print "Found it\n"; } else { print "No luck\n"; }
The result of running the code is "No Luck"

Update

imp's reply works. Also I already use Net::Amazon, the API does not contain the information that I need.

Replies are listed 'Best First'.
Re: LWP simple question
by imp (Priest) on May 24, 2007 at 04:04 UTC
    They are probably just checking the user agent. This worked for me:
    #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; my $amazon_url = q{http://www.amazon.com/exec/obidos/ASIN/0394756673/r +ef=nosim/bookreadersre-20}; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3 +) Gecko/20070309 Firefox/2.0.0.3'); my $response = $ua->get($amazon_url); my $page = $response->content; my $sought_string = q{offer-listing}; if ($page =~ /$sought_string/){ print "Found it\n"; } else { print "No luck\n"; }
Re: LWP simple question
by Fletch (Bishop) on May 24, 2007 at 12:12 UTC

    You might also consider looking at Amazon's web services offerings which would probably give you access to the same content without having to resort to scraping HTML.

Re: LWP simple question
by tomfahle (Priest) on May 24, 2007 at 15:57 UTC
Re: LWP simple question
by gferguson (Acolyte) on May 25, 2007 at 15:53 UTC
    I agree the Amazon API is probably your best bet.

    For your consideration, and I'm not saying Amazon is doing this, but I've run into sites that will redirect the request if they can't pass you a cookie or session id and get it back or they see your referer isn't from inside their domain, and you're requesting a secondary-page instead of the main page.

    My workaround was to use WWW::Mechanize, because it maintains info for the pages retrieved. It's just a big wrapper for LWP::UserAgent, as is LWP::Simple, only WWW::Mechanize is smarter... or maybe more full-featured. Anyway, it's easy to use too, and probably should be a part of your toolkit.

      Mechanize knows links. Mechanize knows images. Mechanize's save_content() method creates files on the fly. There's really no reason NOT to use it.

      xoxo,
      Andy

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://617155]
Front-paged by valavanp
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-19 21:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found