Your skill will accomplish what the force of many cannot |
|
PerlMonks |
in reply to Re: How to parse not closed HTML tags that don't have any attributes?
in thread How to parse not closed HTML tags that don't have any attributes?
Perhaps a rare case where regex can indeed parse (broken) html?
Sorry for the direct answer, of course it is IMHO, but in this case my opinon happens to be fairly strong :-)
today, most html is not handcrafted but machine-made via code.
... which means that browsers and other HTML parsers still have to deal with broken, hand-crafted HTML, even today. For example, HTML::Parser is now over 25 years old (it used to be known as HTML::Parse and was part of libwww-perl for a while), so it was written during the days where hand-crafted HTML was the norm, and it'll happily handle the broken HTML as well. Note that it's also the basis used in many other modules, like HTML::TreeBuilder and WWW::Mechanize.
use warnings; use strict; use HTML::TreeBuilder::XPath; my $p = HTML::TreeBuilder::XPath->new; $p->parse(<<'HTML'); <div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p> </div> HTML my %hash = map { $_->as_trimmed_text } $p->findnodes('//*[@class="phone"]/p'); use Data::Dump; dd \%hash; __END__ { Telefax => "", Telephone => "0123-4 56 78 90" }
Of course, there may still be exceptions that even parsers can't handle. For example, say something like "<p Text</p>" - though of course a browser won't display this correctly, so even someone making a typo like this when writing HTML by hand would hopefully notice, plus, even here I would suggest first fixing the string with a regex and then sending it through an HTML parser.
|
---|