Re: Re: Scraping HTML: orthodoxy and reality

in reply to Re: Scraping HTML: orthodoxy and reality
in thread Scraping HTML: orthodoxy and reality

Ironically the distinction that you draw is the same one that I use to argue against using regular expressions for parsing problems.

Regular expressions are designed as a tool for locating specific patterns in a sea of stuff. (Well until Perl 6 that is...) Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it. Parsing is a lot more work, but for structured text is going to give much more robust solutions. For instance you avoid different kinds of data being mistaken for each other.

The problem is that people are used to using regular expressions for text manipulation, and then set out to solve what is really a parsing probem with regular expressions. Then fail (and may or may not realize this). This happens so routinely that the knee-jerk response is that virtually anything which can be done with parsing should be, rather than regular expressions. And indeed this is good advice to give to someone who doesn't understand the parsing wheels - if only to avoid the problem of all problems looking like nails for the one hammer (regexps) that you have.

However the two kinds of problems are different and do overlap. Where they do overlap, it isn't necessarily obvious which is more practical. It isn't even necessarily obvious from the problem specification - sometimes you need to make a guess about how the code will evolve to know that...

In Section Meditations