Reading multiple web sites

davies has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to work out a strategy, but I'm floundering because I don't know which modules I should be looking at. I want to compare information on the same subject from a number of web sites. Some of these are javascripted, so I was intending to try to automate a browser to get to the right pages. In the initial stages at least, I was expecting to open the browser with one tab per site, navigate to the page manually (I'd like to automate that in due course) and then extract the stuff that interests me from the pages. I will want to refresh the pages at various times as the data changes. I'm paranoid about JS, and was therefore planning to use Firefox on Linux, as the risk of damage from a malicious page is reduced. However, I'm not committed to that if there's a better solution available.

I am facing several problems that I don't know how to approach. First, it's not clear to me how to go about automating Firefox. MozRepl describes MozRepl as "This module is perl interface of MozRepl", which leaves me unsure what MozRepl is or whether I need it. Also, it's version 0.06, which makes me afraid I might be trying to use something not really production ready yet. I'm also not clear how to deal with pages that are only accessible via JS. If I try bookmarking them, opening the bookmark takes me to the site's home page rather than the point I had reached.

Shopping sites seem to be able to get prices from multiple stores even when they use JS, so I believe that what I want can be done. However, I haven't found any useful documentation. If there are docs out there that cover what I want, I should be most grateful for any pointers, as well as any suggestions for a better approach.

Regards,

John Davies

Comment on Reading multiple web sites

Replies are listed 'Best First'.
Re: Reading multiple web sites by marto (Cardinal) on Jun 24, 2010 at 10:47 UTC
Have you looked at using WWW::Mechanize::Firefox to do the heavy lifting or Using WWW::Selenium To Test Or Automate An Ajax Website? I'm not sure if these shopping sites scrape pages or use some specific api to access pricing data. Cheers	[reply]
Re: Reading multiple web sites by Herkum (Parson) on Jun 24, 2010 at 14:16 UTC
You may or may not realize it but there is a good chance that those "stores" already have an API for sharing pricing data. A good example of this would be EBay. I posted some things for sale there and I noticed when I searched for similar items I ended up getting several web sites that were displaying my item for sale! Chances are that someone is just selling a kit of some sort that gets the information for them. That being said it can be difficult to write your own even using their API. EBay has modules on CPAN for doing this stuff, I looked at it and it is written completely in EBay techno babble. (God forbid they write something simple and comprehensible to the average joe). Basically, I am suggesting that you might be better off trying those published API's instead of rolling your own. You can very easily encounter a lack of conformity for the information that you want to get, as well as invalid HTML. That is not even touching JS issues.	[reply]


Perl: the Markov chain saw
	PerlMonks