Re: Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks

by bliako (Prior)
on Jun 03, 2020 at 17:30 UTC

in reply to Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks

Here is another argument for your case:

A regex is a Graph. HTML::TreeBuilder/Mojo::DOM produce something very similar but much less complex: a (directed, acyclic) Graph, i.e. the HTML Tree, the DOM. Where each HTML token/node in that tree is represented by separate regexs and can be conveniently considered as a black box and put aside or switched-off as a separate sub() so-to-speak. Somebody parsing with a single regex is actually smashing all the black boxes and building everything at the character-level: both the identification of the HTML tokens and the HTML syntax tree. That's 2 different sets of rules put into one logic unit. What's more, the 2nd set of rules makes distinction between tags, attributes, values, content. It's much higher-level than the first one. It's much more difficult to retain the meaning of "tag" and re-use it. This is a task of huge complexity. Sooner or later who follows the regex method will either re-discover HTML::TreeBuilder (directly or indirectly via regex embeded code) or die trying.

Then, once you have the DOM tree you can query it as many times as you like and quite efficiently too because you are using the right tool: a Tree data structure operating at the tag level. Whereas -- correct me if I am wrong here but -- with a regex you must re-parse the same HTML content, at the character level, for each query.

Plus the TreeBuilder method can be easier to re-cycle being higher level. It can be serialised, saved, reloaded, passed as function param by reference.

p.s. something to visualise the herculean task of a regex-engine:

bw, bliako

