Re^2: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!)

the example html is truly appaling, using outdated html attributes for presentation, and not providing any elements that specify document structure. it is just the conditions that these sorts of html's live in, that make it conducive to break frequently and unexpectedly...as the person owning the document uses frontpage (or worse) msword to generate html content, with outdated html attributes all over the place. as there's no real structural html elements, with just look/feel elements and data intetmixed, changes to document by owner are usually very naive in terms of valid/sane html. for example you may start seeing at some stage several empty opening and closing font tags. still looks exactly same as before, but the naive generation of html using msword or older frontpage, now breaks the scraping code and you're back at square one trying to figure it out.
whilst scraping (sometimes very bad) html is inevitable, the way you go about it can make some difference. basing a regex for html scraping on the value of a particular attribute is particularly bad, e.g. don't look for "font size="1">"....if you must base it on the font tag, just look for the tag and nearest closing brace, as an anchor.
i've worked with a federated searching java based engine for some time, and it is exactly when the vendor wrote scraping code to match frequently changing html (e.g. html attributes) that often ended up breaking these scrapers. so instead of moving onto bigger and better things..you end up maitaining a whole lot of scrapers that break all the time, and you're at the pointy end of the "fix it now".
in my opinion the example html is so bad as to be practically of no use, and you might as well use module or whatever to strip html altogether, and just base the scraping of the well defined terms that have a following colon.
btw when i mentioned "naive" html document owner, i don't mean to be nasty, just means they don't know or do any better for whatever reason. it's naive in terms of using html with regard to spec and current best practice.

the hardest line to type correctly is: stty erase ^H

Comment on Re^2: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!)

Replies are listed 'Best First'.
Re^3: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!) by BrowserUk (Patriarch) on Nov 26, 2010 at 04:34 UTC
in my opinion the example html is so bad as to be practically of no use, and you might as well use module or whatever to strip html altogether, and just base the scraping of the well defined terms that have a following colon. There is a simple maxim taught to me by my first boss in programming: don't do what won't benefit you. All we have to go on is that bad html snippet the OP posted. In all likelihood, all he has to go on is that html snippet grabbed from whatever website it came from. We could try to predict what might happen in the future and cater for it, but the highest probability is that whatever we guess will be wrong. The only sensible thing to do is work with what we know. And what we know for now is that the simple regex used works. If, in the future it changes, then the 5 minutes it took to construct the program above maybe be required to be repeated. If it then changes again, maybe there would be some pattern to the change that might suggest a better approach. But, it might never change; and any effort expended now to try and cater for unknown changes that might never happen would be entirely wasted. If these numbers were embedded in a plain text document, no one here would blink an eye about using regex. But add a few <> into the mix and suddenly many start trotting out cargo-cult wisdoms: "Don't parse HTML/XML/XHTML/whatever with regex"; completely missing that most of the time nobody wants to parse the html; just extract some small subset of text from a larger set of text. Ie. They want to do exactly what regex are designed to do. basing a regex for html scraping on the value of a particular attribute is particularly bad, e.g. don't look for "font size="1">"....if you must base it on the font tag, just look for the tag and nearest closing brace, as an anchor. I'll take your word for the quality or lack thereof of the html, because I neither know nor care. It's just text within text to me. For now, what I've suggested to the OP works. And it works 500 times more quickly that his existing solution. If he gets to use it once before the sources changes, he can afford to spend 3 working days re-writing it and still have gained. And it took me less than 5 minutes to write this version and maybe 10 to test it; most of which was taken up generating 1000 test pages. If he gets to use it 10 times, he's saved himself enough time to take a month's vacation. It's simple. It works. Job done. And if it requires change next week, or next month or next year, it is simple enough that it won't require deep knowledge of half a dozen co-dependant packages and APIs in order to fix it. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]
Re^3: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!) by ww (Archbishop) on Nov 26, 2010 at 04:49 UTC
appaling (sic), you say? Well, the nested tables are awkward and the use of various outdated or deprecated tags is unfortunate; the lack of quotes and the like can certainly be labeled "mistakes." But "appalling" is a pretty strong word. Perhaps "dated" or similar would be better. ...so bad as to be practically of no use. Even harsher (and IMO, excessive), particularly since what we know about the html fails to support any inference that OP bears any responsibility. There is, however, a valuable nugget that saves your post from a quick downvote -- the notion that future changes could break a regex solution. OTOH, any solution we can readily offer today would also be broken were the html converted to 100% compliant xml.	[reply]
Re^4: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!) by aquarium (Curate) on Nov 28, 2010 at 23:18 UTC
i take the criticism for using strong words. the html as provided is reminiscent of 80's websites. the html provides no data structure elements, merely look/feel html elements/attributes. hence as the tags are superficial in terms of data, it would be easier and less likely to break if the scrape is based on the html converted to text. basing the regexes on the well defined terms followed by a collon. basing the regexes instead on largely irrelevant (look/feel) html, I see as not an effective design. i'm not criticising the html just for the sake of being critical...but i believe in basing programs on best variant of input data. so if one has no control over the website, then at least making best attempt at getting non-breaking data is better (imho) to just scraping the worst and hoping for the best. so instead of merely criticising, it's basing decisions on likely factors...deciding to use text form of data instead of anchoring on outdated or likely-to-be-not-well-formed html. that's all. the hardest line to type correctly is: stty erase ^H	[reply]


"be consistent"
	PerlMonks