Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Most efficient way to parse web pages

by cbraga (Pilgrim)
on Jun 19, 2000 at 05:09 UTC ( [id://18754]=perlquestion: print w/replies, xml ) Need Help??

cbraga has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a perl script that takes html files, up to 50 kB in size, and I need to extract some information, which are in a few lines. I wonder what would be the most efficient ways to find the information.

A number of ways come to mind, such as using regexps, or grep then regexps, but it is quite important to minimize CPU usage as this will be part of a web spider, and I thought some of the monks might have better ideas. Thanks.

Replies are listed 'Best First'.
Re: Most efficient way to parse web pages
by merlyn (Sage) on Jun 19, 2000 at 05:20 UTC
Re: Most efficient way to parse web pages
by eduardo (Curate) on Jun 19, 2000 at 07:17 UTC
    at work, we've written a distributed web spider... basically it's a forking model, that then get's thrown around on a mosix cluster... but anyways, i digress. what we've done is used the Parse::RecDescent module from CPAN and built up a grammer for the parsing of webpages. Then we describe a website using the metalanguage described above and it generates an automaton that goes out, grabs the webpage, and removes the important parts. Very flexible, very powerful, and we can parse millions of pages a day with it.
Re: Most efficient way to parse web pages
by Vane (Novice) on Jun 19, 2000 at 08:10 UTC
    If you know exactly what you're looking for, and it's in random order (all pages are equally important, not a tree; uncontrolled HTML -- you didn't write it) *nothing* beats a forking LWP get, slurp, match except multiple machines doing parallel (fork, get, slurp, match). Parser has to parse all sorts of fat, sloppy, mixed content, rarely correct HTML written by fools with Word, FrontPage or Dreamweaver so it looks good. I take it you're looking for something specific. Never looked at the Parse::RecDescent module though; but it has to do something less involved than HTML::Parser. I would look into it, thanks, but you'll needs write your own anyway. So, "keep it short and simple", "spread the work", "tune the fork(s)". "tips:" Slurp with $/ keyed on what you're looking for if what you're looking for is likely to be near the beginning of the page, or null (and m//g) if it's not or if there are several potentially random occurances. (m//g & pos) is real quick and nestable. Because multiple forks and multiple machines are by nature asynchronous, they make a mighty engine over TCP/IP by using the space. Vane

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://18754]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-16 10:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found