Re: Scraping HTML: orthodoxy and reality

in reply to Scraping HTML: orthodoxy and reality

"Parse" vs "extract" or "regular language" vs "context free," etc. are indeed important distinctions to be made, as pointed out by some monks. Parsing data is a (more or less) mechanical process; extracting info is a human (A.I.) process.

Suppose you want to extract info by paragraph. Consider the following text fragment:

________________________________________

Look at the table below...
ho ho ho...

Could you behold the secret this unfolds?

A bit more, a bit more, irrelevant thought, a new paragraph...

________________________________________

You might see either two or three paragraphs (if you consider "Look... unfolds?" as one paragraph). Now, let's look at the html of the above text fragment:

<p>Look at the table below...
<table border="1"><tr><td><p>ho ho ho...</p></td></tr></table><br><br>
Could you behold the secret this unfold?<br><br>

A bit more, a bit more, irrelevant thought, a new paragraph...</p>
[download]

A parser might only see one paragraph between the  and  tags. There is a  pair in the table. Is it a paragraph? A parser might ask.

Suppose the parser takes into consideration that some people use   to denote the end of a paragraph. "Look..." and "Could..." might be considered two paragraphs. What about "A bit..."? Or are "Look..." and the table two paragraphs?

Human can read semantically; machine mostly syntactically. That's why extracting info is not the same problem as parsing data.

In Section Meditations