Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Easiest Way To Cut Info from Webpages

by kingdean (Novice)
on Jun 17, 2004 at 18:56 UTC ( #367725=perlquestion: print w/replies, xml ) Need Help??

kingdean has asked for the wisdom of the Perl Monks concerning the following question:

I am looking for a way, or perl script, that will cut info from a webpage. Let's assume that the page doesn't change formats. For example, if has a deal of the day and the only thing on that page that changes is the picture, price, and item name, then there should be a way to have a script call upon that page and look for the start and end of the section and pull the relevant info off of that page to my test page. My test page would say "Today's Deal from KBTOYS is" then the item name would show up here. Then I could say "The price is" and then the item price would show up here etc. Does anyone know of any perl codes that can be modified to do such a thing. Where I can put any URL in, and the start and end points to get the info on the fly for what I want? Or if not on the fly, maybe something that would be scheduled to run a few times a day or something? Does anyone know how to write a script like that? Thanks Dean

Replies are listed 'Best First'.
Re: Easiest Way To Cut Info from Webpages
by jeffa (Bishop) on Jun 17, 2004 at 19:10 UTC

    Have a look at HTML::Parser or HTML::TokeParser. If you just want to get the job done, chances are good that what you want is unique enough to pull out with a regular expression. I don't normally recommend regexes for this kind of job, but they do get the job done. At any rate, you will want to fetch the web page so you can parse it, and for that i recommend WWW::Mechanize. In conclusion, parsing web pages is generic, but parsing a specific web page is not, so you probably will not find an existing script for the website you are trying to parse, and if you do, the chances are good that it won't work for you. This is why you generally have to start from scratch, and inspect the HTML you are trying to parse with your own two eyes. And yes, as soon as the WebMaster changes the HTML, your script will probably break. :)


    (the triplet paradiddle with high-hat)
Re: Easiest Way To Cut Info from Webpages
by Joost (Canon) on Jun 17, 2004 at 19:12 UTC
Re: Easiest Way To Cut Info from Webpages
by calin (Deacon) on Jun 17, 2004 at 19:09 UTC
    I am looking for a way, or perl script, that will cut info from a webpage.

    The "technical" term for this is "screen scraping".

      Thanks for the info. Dean
Re: Easiest Way To Cut Info from Webpages
by Aragorn (Curate) on Jun 17, 2004 at 19:26 UTC
Re: Easiest Way To Cut Info from Webpages
by Popcorn Dave (Abbot) on Jun 18, 2004 at 06:09 UTC
    Joost is right on this. I did something similar a while back grabbing newspaper headlines and LWP::Simple did the trick for me. Of course at the time I didn't know about HTML::TokeParser and would have made my job a whole lot easier.

    You will want to save a copy of the source for a few days to make sure you're that the information you're looking for is in the same place every time. What you're going to want to look for is HTML comments. Hopefully the page you're scraping is going to have those around what you want. Then it's just a simple matter of reading until you get to the point you want to parse, parse it, and you're done.

    In addition, if you look here, this node contains a small program I wrote using HTML::TokeParser so you can see what you're going to get as output using that module. That may help you if you go that direction.

    Hope that helps!

    There is no emoticon for what I'm feeling now.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://367725]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2022-11-26 16:01 GMT
Find Nodes?
    Voting Booth?