Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks

TL;DR: Working code below!

Say you "just" want to extract some links. Are you sure the HTML's formatting will never change (whitespace, order of attributes, its structure, and so on)? Well, here's some perfectly valid HTML - good luck!

This is valid HTML5:

<a
href
=
"http://www.example.com/1"
>
One
</a
>
<a id="Two" title="href="></a>
<!--
<a href="http://www.example.com/3">Three</a>
-->
<a title=' href="http://www.example.com/4">Four'
href="http://www.example.com/5">Five</a>
<script>
console.log(' <a href="http://www.example.com/6">Six</a> '); /*
<!--
*/ </script>
<a href=http://www.example.com/7>Se<span
>v&#101;</span>n</a>
<script>/* --> */</script>
[download]

In addition, replace everything starting with the first <script> tag with this, and you've got valid XHTML - in other words, valid XML as well:

<script type="text/javascript">/*<![CDATA[
</script>
*/ console.log(' <a href="http://www.example.com/6">Six</a> '); /*
<!--
]]>*/</script>
<a href="http://www.example.com/7"><![CDATA[Se]]><span
>v&#101;</span>n</a>
<script type="text/javascript">/*<![CDATA[
-->
]]>*/</script>
<![CDATA[
<a href="http://www.example.com/8">Eight</a>
]]>
[download]

(There's only three links, "One", "Five", and "Seven".)

Solutions that work on all of the above:

Mojo::DOM (my personal favorite):

use Mojo::DOM;
my $links = Mojo::DOM->new($html)->find('a[href]');
for my $link (@$links) {
    ( my $txt_trim = $link->all_text ) =~ s/^\s+|\s+$//g;
    print $link->{href}, "\t", $txt_trim, "\n";
}
[download]

(Note you can use Mojo::Collection methods instead of the for loop if you like. And on Perl 5.14 and above, the code in the for loop can be simplified to: print $link->{href}, "\t", $link->all_text =~ s/^\s+|\s+$//gr, "\n";. Use Mojo::DOM->new->xml(1)->parse($xml) to use this module to parse XML, including XHTML.)

HTML::TreeBuilder::XPath (a bit older, but still works):

use HTML::TreeBuilder::XPath;
my $p = HTML::TreeBuilder::XPath->new;
$p->marked_sections(1);
$p->xml_mode(1); # DEPENDING ON INPUT
my @links = $p->parse($html)->findnodes('//a[@href]');
for my $link (@links) {
    print $link->attr('href'), "\t", $link->as_text_trimmed, "\n";
}
[download]

HTML::LinkExtor (a well-established module, based on HTML::Parser like the previous solution; only gets link attributes, no text content):

use HTML::LinkExtor;
my $p = HTML::LinkExtor->new;
$p->marked_sections(1);
$p->xml_mode(1); # DEPENDING ON INPUT
my @links = $p->parse($html)->links;
for my $link (@links) {
    my ($tag, %attrs) = @$link;
    print $attrs{href}, "\n";
}
[download]

(Note: for the previous two solutions, you might be tempted to do $p->xml_mode( $html=~/^\s*<\?xml/ );, but note that this isn't completely reliable - some XML documents may not have an XML processing instruction, and this regex is very simplistic. It's much more reliable if you know your inputs.)

For even more potential solutions, see the thread Parsing HTML/XML with Regular Expressions. For example, XHTML can be parsed with XML::LibXML.

All of the above code (and more!) is also available as a Gist: https://gist.github.com/haukex/fd76efa16f0b07ce6a7441d9b2265b2a

Update 2020-05-28: Edited title to reflect that the XHTML example is just as much about XML as HTML.

Comment on Why a regex really isn't good enough for HTML and XML, even for "simple" tasks Select or Download Code

Replies are listed 'Best First'.
Re: Why a regex really isn't good enough for HTML, even for "simple" tasks by Corion (Patriarch) on May 08, 2020 at 07:30 UTC
Here's the solution using plain WWW::Mechanize. It fails the XHTML test, because (I think) it uses HTML::TokeParser, and somehow misparses the Six link: `#!/usr/bin/env perl use warnings; use strict; my $file = shift or die; print "##### WWW::Mechanize on $file #####\n"; my $html = do { open my $fh, '<', $file or die "$file: $!"; local $/; +<$fh> }; use WWW::Mechanize; my $mech = WWW::Mechanize->new(); $mech->update_html($html); my @links = $mech->links(); for my $link (@links) { print $link->url, "\t", $link->text, "\n"; }` [download] Since HTML::TokeParser and HTML::Parser even live in the same distribution, I'll look at a pull request to change the parser type to the one that works. Update: The pull request	[reply] [d/l]
Re^2: Why a regex really isn't good enough for HTML, even for "simple" tasks by haukex (Archbishop) on May 08, 2020 at 07:47 UTC
Thank you, I've added this one to the Gist too! (and in the meantime I added a XML::LibXML solution for the XHTML as well) Since HTML::TokeParser and HTML::Parser even live in the same distribution, I'll look at a pull request to change the parser type to the one that works. It looks to me like HTML::TokeParser isa HTML::Parser, so I think it's probably possible to set the options required to parse the XHTML (`marked_sections` and `xml_mode`) - but it looks like WWW::Mechanize doesn't provide any way to set custom options on the parser.	[reply] [d/l] [select]
Re^2: Why a regex really isn't good enough for HTML, even for "simple" tasks by haukex (Archbishop) on May 14, 2020 at 06:53 UTC
Update: The pull request Thank you for this, a new version with this fix was just released and I've updated the WWW::Mechanize solution accordingly, it now passes the tests!	[reply]
Re: Why a regex really isn't good enough for HTML, even for "simple" tasks by hippo (Bishop) on May 05, 2020 at 14:04 UTC
Just for fun, here's a low-level solution using vanilla HTML::Parser. <Reveal this spoiler or all in this thread>	[reply] [d/l]
Re^2: Why a regex really isn't good enough for HTML, even for "simple" tasks by haukex (Archbishop) on May 08, 2020 at 07:14 UTC
Thank you! I've added a slightly modified version to the Gist!	[reply]
Re: Why a regex really isn't good enough for HTML, even for "simple" tasks by Corion (Patriarch) on May 08, 2020 at 09:30 UTC
WWW::Mechanize::Chrome curiously fails the XHTML test, which I've tentatively reported as a Chromium / DevTools bug. The HTML rendering and DOM inspector properly parse the HTML, but the DevTools return "Six" as a node, which isn't really true. The code also uncovered a bugs/unexpected behaviour in how the link text gets constructed, so I'll upload a fixed version of WWW::Mechanize::Chrome soon. `#!/usr/bin/env perl use warnings; use strict; my $file = shift or die; print "##### WWW::Mechanize::Chrome on $file #####\n"; my $html = do { open my $fh, '<', $file or die "$file: $!"; local $/; +<$fh> }; use Log::Log4perl ':easy'; use WWW::Mechanize::Chrome; Log::Log4perl->easy_init($WARN); my $mech = WWW::Mechanize::Chrome->new( headless => 1); $mech->update_html($html); my @links = $mech->links(); for my $link (grep { $_->url } @links) { print $link->url, "\t", $link->text, "\n"; }` [download] Update: Actually, as the page itself contains "confusing" (to Chrome) information, this is somewhat explainable. The HTML is XML, but it later declares a `Content-Type` of `text/html`. Changing that to `Content-Type` `text/xhtml` makes (WWW::Mechanize::)Chrome report the correct links. I still wonder if this parser confusion between DevTools and Javascript could be exploited somehow.	[reply] [d/l] [select]
Re^2: Why a regex really isn't good enough for HTML, even for "simple" tasks by haukex (Archbishop) on May 08, 2020 at 18:09 UTC
Actually, as the page itself contains "confusing" (to Chrome) information, this is somewhat explainable. The HTML is XML, but it later declares a `Content-Type` of `text/html`. Changing that to Content-Type `text/xhtml` makes (WWW::Mechanize::)Chrome report the correct links. Interesting, thanks! According to several sources on the W3C website, the correct MIME type is `application/xhtml+xml`, so I've changed that.	[reply] [d/l] [select]
Re: Why a regex really isn't good enough for HTML, even for "simple" tasks by tobyink (Canon) on May 07, 2020 at 23:36 UTC
Also, this is valid HTML5 (and also valid HTML 4.01, HTML 4.0, HTML 3.2, and HTML 2.0 too). `<a href=One.html>One</a>` [download] toby d�t ink	[reply] [d/l]
Re^2: Why a regex really isn't good enough for HTML, even for "simple" tasks by haukex (Archbishop) on May 08, 2020 at 07:05 UTC
Updated, thank you!	[reply]
Re: Why a regex really isn't good enough for HTML and XML, even for "simple" tasks by choroba (Cardinal) on Sep 25, 2020 at 09:32 UTC
This is an interesting regex semi-solution: Shallow XML parsing using XML::Parser::REX which is regex-based. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re: Why a regex really isn't good enough for HTML and XML, even for "simple" tasks by bliako (Monsignor) on Jun 03, 2020 at 17:30 UTC
Here is another argument for your case: A regex is a Graph. HTML::TreeBuilder/Mojo::DOM produce something very similar but much less complex: a (directed, acyclic) Graph, i.e. the HTML Tree, the DOM. Where each HTML token/node in that tree is represented by separate regexs and can be conveniently considered as a black box and put aside or switched-off as a separate sub() so-to-speak. Somebody parsing with a single regex is actually smashing all the black boxes and building everything at the character-level: both the identification of the HTML tokens and the HTML syntax tree. That's 2 different sets of rules put into one logic unit. What's more, the 2nd set of rules makes distinction between tags, attributes, values, content. It's much higher-level than the first one. It's much more difficult to retain the meaning of "tag" and re-use it. This is a task of huge complexity. Sooner or later who follows the regex method will either re-discover HTML::TreeBuilder (directly or indirectly via regex embeded code) or die trying. Then, once you have the DOM tree you can query it as many times as you like and quite efficiently too because you are using the right tool: a Tree data structure operating at the tag level. Whereas -- correct me if I am wrong here but -- with a regex you must re-parse the same HTML content, at the character level, for each query. Plus the TreeBuilder method can be easier to re-cycle being higher level. It can be serialised, saved, reloaded, passed as function param by reference. p.s. something to visualise the herculean task of a regex-engine: https://regexper.com/ bw, bliako	[reply]
Re: Why a regex really isn't good enough for HTML, even for "simple" tasks by ikegami (Patriarch) on May 09, 2020 at 08:50 UTC
Your argument is utterly unconvincing. People use regex to extract from HTML documents because it works. They wouldn't use a regex to extract the urls from the document you provided because it wouldn't work. The real reason not to create a half-assed parser (using regex or otherwise) is this phrase we've all heard: "But it worked yesterday." This is what you'll get with a hacked up solution because it's going to be far less resilient to change and a lot more expensive to maintain than one using a proper parser. Also, there's a good chance you'll spend far more time developing the hacked up solution as you keep finding corner cases. Update: Replaced claim the presented task isn't a simple task with an explanation of why isn't one. Sorry, this was done within seconds of posting.	[reply]
Re^2: Why a regex really isn't good enough for HTML, even for "simple" tasks by haukex (Archbishop) on May 09, 2020 at 08:59 UTC
Your argument is utterly unconvincing. Noone would claim that parsing that HTML is a simple task. Except that's not what I said, and people do try to use regexes to extract stuff from HTML all the time. The real reason not to create a half-assed parser (using regex or otherwise) is the following: "But it worked yesterday." A hacked up solution is going to be far less resilient to change and a lot more expensive to maintain than one using a proper parser. Which is exactly the argument I made in Parsing HTML/XML with Regular Expressions. Update: PerlMonks has a preview function; I won't be responding to your ninja edits. The above quotes represent the entirety of your post at my time of posting.	[reply]
Re^3: Why a regex really isn't good enough for HTML, even for "simple" tasks by marto (Cardinal) on May 09, 2020 at 09:05 UTC
Also '..Mark the changed/new content with the word "Update..." from How do I post a question effectively?.	[reply]
Re^3: Why a regex really isn't good enough for HTML, even for "simple" tasks by ikegami (Patriarch) on May 09, 2020 at 09:20 UTC
people do try to use regexes to extract stuff from HTML all the time. I know. And like I said, your argument isn't going to convince a single one of them to stop. They will see their tasks as simple tasks and yours as complex, and you completely failed to show why regex shouldn't be used for simple tasks despite your claims. Perhaps you should add an explanation as to why they shouldn't be used for simple tasks?	[reply]
Re^4: Why a regex really isn't good enough for HTML, even for "simple" tasks by haukex (Archbishop) on May 09, 2020 at 09:37 UTC
Re^4: Why a regex really isn't good enough for HTML, even for "simple" tasks by ikegami (Patriarch) on May 09, 2020 at 09:31 UTC
Re^5: Why a regex really isn't good enough for HTML, even for "simple" tasks (updated) by haukex (Archbishop) on May 09, 2020 at 09:40 UTC
Re^3: Why a regex really isn't good enough for HTML, even for "simple" tasks by ikegami (Patriarch) on May 09, 2020 at 09:03 UTC
Except that's not what I said You said: "Why a regex really isn't good enough for HTML, even for "simple" tasks". So yeah, you did. Which is exactly the argument I made in Parsing HTML/XML with Regular Expressions. ok, but it's what you said here I'm commenting on.	[reply]
Re^4: Why a regex really isn't good enough for HTML, even for "simple" tasks by haukex (Archbishop) on May 09, 2020 at 09:36 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.