In short you can't get perl to Pretend to be a 'Real' Web Browser ie IE/Mozilla/NS/Opera. You can fake all the behaviour except for the Javascript/DOM redirection part. You can fake some javascript support but to do it all you need the DOM.
Here is a list of some of the issues you will need to deal with to get the 'real' pages.
- Use LWP::UserAgent to get the pages, works in vanilla form for > 90% pages
- Add a random agent string so LWP pretends to be IE 5/5.5/6. The easiest way to get them is to grep your apache access logs. There are also plent of lists on the net.
- Add in support for meta-refresh redirects (there are about 6 different 'valid' syntaxes - where valid means that browsers accept them)
- Add in frames support (vital)
- Add in cookie support as this is often tested for.
Once you have done all that the only 'rejects/cloaking' you will get will involve javascript redirects. There are numerous different variations of window.location = blah, window.location(blah), href.location = blah, href.location(blah), etc, etc.
Some of these you can parse and follow. Some you can't as they concat bits of the DOM into the redirect string.
When it comes to parsing the HTML HTML::Parser will cough up the javascript either in the comments or text (depending on how it is wrapped) so it is sub optimal. If you are only interested in popups you are basically looking for window.open and a few other strings. You can parse these out reasonably reliably with REs
We implemented all of the above on a current project, but eventually ended up hacking IE so that it is a headless, windowless, slave that goes and does our bidding. The nice part of that solution is that it really is IE doing the fetching so ..... no-one can tell it isn't IE. IE parses the HTML, sets the DOM, runs the javascript etc. We just gather up the HTML data from the parent and any child windows. You can hack Mozilla in a similar fashion. cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
|