Have a look at HTML::Parser or HTML::TokeParser. If you just want to get the job done,
chances are good that what you want is unique enough to pull out with a regular expression. I don't
normally recommend regexes for this kind of job, but they do get the job done. At any rate, you will
want to fetch the web page so you can parse it, and for that i recommend WWW::Mechanize. In
conclusion, parsing web pages is generic, but parsing a specific web page is not, so you probably will
not find an existing script for the website you are trying to parse, and if you do, the chances are good
that it won't work for you. This is why you generally have to start from scratch, and inspect the HTML
you are trying to parse with your own two eyes. And yes, as soon as the WebMaster changes the HTML, your
script will probably break. :)
| [reply] |
| [reply] |
Thanks for the info.
Dean
| [reply] |
| [reply] |
Joost is right on this. I did something similar a while back grabbing newspaper headlines and LWP::Simple did the trick for me. Of course at the time I didn't know about HTML::TokeParser and would have made my job a whole lot easier.
You will want to save a copy of the source for a few days to make sure you're that the information you're looking for is in the same place every time. What you're going to want to look for is HTML comments. Hopefully the page you're scraping is going to have those around what you want. Then it's just a simple matter of reading until you get to the point you want to parse, parse it, and you're done.
In addition, if you look here, this node contains a small program I wrote using HTML::TokeParser so you can see what you're going to get as output using that module. That may help you if you go that direction.
Hope that helps!
There is no emoticon for what I'm feeling now.
| [reply] |