Re^2: How to parse not closed HTML tags that don't have any attributes?

Perhaps a rare case where regex can indeed parse (broken) html?

Sorry for the direct answer, of course it is IMHO, but in this case my opinon happens to be fairly strong :-)

today, most html is not handcrafted but machine-made via code.

... which means that browsers and other HTML parsers still have to deal with broken, hand-crafted HTML, even today. For example, HTML::Parser is now over 25 years old (it used to be known as HTML::Parse and was part of libwww-perl for a while), so it was written during the days where hand-crafted HTML was the norm, and it'll happily handle the broken HTML as well. Note that it's also the basis used in many other modules, like HTML::TreeBuilder and WWW::Mechanize.

use warnings;
use strict;
use HTML::TreeBuilder::XPath;

my $p = HTML::TreeBuilder::XPath->new;
$p->parse(<<'HTML');
        <div class="phone">
          <div class="icon"></div>
          <p class="title">Telephone</p>
          <p>0123-4 56 78 90          
          <p class="title">Telefax</p>
          <p>        </div>
HTML
my %hash = map { $_->as_trimmed_text }
    $p->findnodes('//*[@class="phone"]/p');
use Data::Dump; dd \%hash;

__END__

{ Telefax => "", Telephone => "0123-4 56 78 90" }
[download]

Of course, there may still be exceptions that even parsers can't handle. For example, say something like "<p Text</p>" - though of course a browser won't display this correctly, so even someone making a typo like this when writing HTML by hand would hopefully notice, plus, even here I would suggest first fixing the string with a regex and then sending it through an HTML parser.

Comment on Re^2: How to parse not closed HTML tags that don't have any attributes? Select or Download Code


Your skill will accomplish what the force of many cannot
	PerlMonks

Username:
Password:

This is PerlMonks "Mobile"