Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: How to parse not closed HTML tags that don't have any attributes?

by bliako (Monsignor)
on Mar 06, 2021 at 22:13 UTC ( [id://11129218]=note: print w/replies, xml ) Need Help??


in reply to How to parse not closed HTML tags that don't have any attributes?

There is method in madness: today, most html is not handcrafted but machine-made via code. In your case the code looks broken (as opposed to the html you have shown being broken - which it is). That can be to your advantage as it may be broken in a consistent way.

Perhaps a rare case where regex can indeed parse (broken) html?

bw, bliako

  • Comment on Re: How to parse not closed HTML tags that don't have any attributes?

Replies are listed 'Best First'.
Re^2: How to parse not closed HTML tags that don't have any attributes?
by haukex (Archbishop) on Mar 07, 2021 at 06:32 UTC
    Perhaps a rare case where regex can indeed parse (broken) html?

    No.

    Sorry for the direct answer, of course it is IMHO, but in this case my opinon happens to be fairly strong :-)

    today, most html is not handcrafted but machine-made via code.

    ... which means that browsers and other HTML parsers still have to deal with broken, hand-crafted HTML, even today. For example, HTML::Parser is now over 25 years old (it used to be known as HTML::Parse and was part of libwww-perl for a while), so it was written during the days where hand-crafted HTML was the norm, and it'll happily handle the broken HTML as well. Note that it's also the basis used in many other modules, like HTML::TreeBuilder and WWW::Mechanize.

    use warnings; use strict; use HTML::TreeBuilder::XPath; my $p = HTML::TreeBuilder::XPath->new; $p->parse(<<'HTML'); <div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p> </div> HTML my %hash = map { $_->as_trimmed_text } $p->findnodes('//*[@class="phone"]/p'); use Data::Dump; dd \%hash; __END__ { Telefax => "", Telephone => "0123-4 56 78 90" }

    Of course, there may still be exceptions that even parsers can't handle. For example, say something like "<p Text</p>" - though of course a browser won't display this correctly, so even someone making a typo like this when writing HTML by hand would hopefully notice, plus, even here I would suggest first fixing the string with a regex and then sending it through an HTML parser.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11129218]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-25 16:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found