Re: How to parse not closed HTML tags that don't have any attributes?

There is method in madness: today, most html is not handcrafted but machine-made via code. In your case the code looks broken (as opposed to the html you have shown being broken - which it is). That can be to your advantage as it may be broken in a consistent way.

Perhaps a rare case where regex can indeed parse (broken) html?

bw, bliako

Comment on Re: How to parse not closed HTML tags that don't have any attributes?

Replies are listed 'Best First'.

Re^2: How to parse not closed HTML tags that don't have any attributes?
by haukex (Archbishop) on Mar 07, 2021 at 06:32 UTC

Perhaps a rare case where regex can indeed parse (broken) html?

No.

Sorry for the direct answer, of course it is IMHO, but in this case my opinon happens to be fairly strong :-)

today, most html is not handcrafted but machine-made via code.

... which means that browsers and other HTML parsers still have to deal with broken, hand-crafted HTML, even today. For example, HTML::Parser is now over 25 years old (it used to be known as HTML::Parse and was part of libwww-perl for a while), so it was written during the days where hand-crafted HTML was the norm, and it'll happily handle the broken HTML as well. Note that it's also the basis used in many other modules, like HTML::TreeBuilder and WWW::Mechanize.

use warnings;
use strict;
use HTML::TreeBuilder::XPath;

my $p = HTML::TreeBuilder::XPath->new;
$p->parse(<<'HTML');
        <div class="phone">
          <div class="icon"></div>
          <p class="title">Telephone</p>
          <p>0123-4 56 78 90          
          <p class="title">Telefax</p>
          <p>        </div>
HTML
my %hash = map { $_->as_trimmed_text }
    $p->findnodes('//*[@class="phone"]/p');
use Data::Dump; dd \%hash;

__END__

{ Telefax => "", Telephone => "0123-4 56 78 90" }
[download]

Of course, there may still be exceptions that even parsers can't handle. For example, say something like "<p Text</p>" - though of course a browser won't display this correctly, so even someone making a typo like this when writing HTML by hand would hopefully notice, plus, even here I would suggest first fixing the string with a regex and then sending it through an HTML parser.

[reply]
[d/l]
[select]


Do you know where your variables are?
	PerlMonks