Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Scraping HTML: orthodoxy and reality

by gjb (Vicar)
on Jul 08, 2003 at 13:37 UTC ( #272277=note: print w/replies, xml ) Need Help??

in reply to Scraping HTML: orthodoxy and reality

Although I agree that choosing for a regexp approach or a context free grammar approach depends on the problem at hand, I'd like to stress that halley made a very important point:

Rules are meant to be broken, but you've to understand them before you can break them... safely.

Although a lot of Monks will know the distiction between a regular language and a context free language (and I'm sure grinder and BrowserUK do), I'm rather sure that some don't. In the latter case, unfortunately those Monks simply don't know the rules and have lots of opportunity to mess up.

I'd like to paraphrase: "a little thinking is a dangerous thing" if the process is not supported by a proper amount background knowledge.

It is possible to approximate a context free grammar with a regular expression, a nice survey article about that has been written by Mark-Jan Nederhof. There are several good books about formal languages, but I'd particularly recommend Sipser's since it is well written and is nice to read.

Conclusion: even if you know the rules, but don't understand them, don't try and break them. More importantly: try and understand the rules you're following.

Just my 2 cents, -gjb-

  • Comment on Re: Scraping HTML: orthodoxy and reality

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://272277]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (3)
As of 2020-06-07 03:06 GMT
Find Nodes?
    Voting Booth?
    Do you really want to know if there is extraterrestrial life?

    Results (42 votes). Check out past polls.