How to parse not closed HTML tags that don't have any attributes?

Rantanplan has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to parse not closed HTML tags that don't have any attributes? by marto (Cardinal) on Mar 06, 2021 at 20:01 UTC
Consider this Mojo::DOM example, I've made some assumptions as your source data does not seem complete: `cat dragnet.pl #!/usr/bin/perl use strict; use warnings; use Mojo::DOM; use feature 'say'; my $html = '<div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90</p> <p class="title">Telefax</p> <p>just the fax ma\'am</p> </div>'; my $dom = Mojo::DOM->new( $html ); my $phone = $dom->at('div.phone > p:nth-of-type(2)')->text; say $phone; my $fax = $dom->at('div.phone > p:nth-of-type(4)')->text; say $fax;` [download] Prints: `0123-4 56 78 90 just the fax ma'am` [download] Let us know if you have any problems or your input data is somehow weirder. Update: Sorry, late in the day on a Saturday here. Since the HTML isn't valid, and I'm guessing you can't change that try: `#!/usr/bin/perl use strict; use warnings; use Mojo::DOM; use Mojo::Util qw(trim); use feature 'say'; my $html = '<div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p>just the fax ma\'am </div>'; my $dom = Mojo::DOM->new( $html ); my $phone = trim( $dom->at('div.phone > p:nth-of-type(2)')->text ); say $phone; my $fax = trim( $dom->at('div.phone > p:nth-of-type(4)')->text ); say $fax;` [download] Which still prints: `0123-4 56 78 90 just the fax ma'am` [download]	[reply] [d/l] [select]
Re: How to parse not closed HTML tags that don't have any attributes? (updated) by haukex (Archbishop) on Mar 06, 2021 at 20:07 UTC
The HTML is indeed ~~broken~~inconsistent, and you've only shown one sample, so any example code will be correspondingly brittle. Like marto, I would suggest Mojo::DOM, as it has an IMHO nice interface, and it is still able to parse that HTML. `use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; use Data::Dump; my $dom = Mojo::DOM->new(<<'HTML'); <div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p> </div> HTML my %hash = @{ $dom->find('p.title')->map(sub { return ( trim($_->text), trim($_->next->text) ) }) }; dd \%hash; __END__ { Telefax => "", Telephone => "0123-4 56 78 90" }` [download] Update: Assuming you've got a lot of other `<div>`s in your HTML, you may want to change the expression in `->find()` to `'.phone p.title'`.	[reply] [d/l] [select]
Re^2: How to parse not closed HTML tags that don't have any attributes? (updated) by tobyink (Canon) on Mar 08, 2021 at 13:24 UTC
The HTML looks fine to me. Many HTML elements have optional end tags. `<p>` is one of them. The following is valid HTML: `<p>I like: <ul> <li>Chocolate <li>Pizza <li>Mexican food </ul> <p>Yet I don't like Mexican chocolate pizza!` [download] Hire me at Toby Ink Ltd or Join my OnlyFans	[reply] [d/l] [select]
Re^3: How to parse not closed HTML tags that don't have any attributes? by haukex (Archbishop) on Mar 08, 2021 at 15:30 UTC
Many HTML elements have optional end tags. Yes, I know - the reason I called it "broken", which admittedly was probably too strong a word (parent updated), is because of the inconsistent closing tags, suggesting at the very least that not very much attention to detail was paid in the production of the HTML.	[reply]
Re^2: How to parse not closed HTML tags that don't have any attributes? (updated) by Rantanplan (Novice) on Mar 07, 2021 at 13:52 UTC
Many thanks again to everyone for all your great help! All your solutions are very tempting. Especially regexp code always looks like pure magic to me. :-) For the moment I've decided to go with the Mojo::DOM alternative, since I'm still very unexperienced with Perl, and since it's understandable for me at least to a little extent. So far it gives me really promising results. There's this wall I ran into, though: `<div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div>` [download] In there, the fields "Street name", "Street number", "ZIP code" and "City name" have been carelessly filled into just a single field, separated by the `<br/>` element. With your help I'm now able to access the whole string with `$dom->find('address')`, but no matter what I do, the `<br/>` element in it always gets removed, so it seems to me that I cannot search inside the address string. I thought this might be because Perl treats it as white space, but I wasn't able to find anything useful. Could you please give me a hint? By the way, thank you for your advice to use Text::CSV. That's a great idea, and I will definitely do that!	[reply] [d/l] [select]
Re^3: How to parse not closed HTML tags that don't have any attributes? by haukex (Archbishop) on Mar 07, 2021 at 14:03 UTC
no matter what I do, the `<br/>` element in it always gets removed, so it seems to me that I cannot search inside the address string. I can't reproduce this (see the output in the `[[]]`s below), and you haven't said what your expected output is or what you mean by "search inside the address string" - see How do I post a question effectively? and Short, Self-Contained, Correct Example. As an example, I can replace the `<br/>` like so: `use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; my $dom = Mojo::DOM->new(<<'HTML'); <div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div> HTML my $addr = $dom->find('.address address')->first; print "[[$addr]]\n"; $addr->find('br')->map('replace',"\n"); print "[", trim($addr->text), "]\n"; __END__ [[<address> Sample Street 123<br>45678 Randomcity </address>] +] [Sample Street 123 45678 Randomcity]` [download] Edit: Forgot to remove the "(updated)" from the node's title before it got a reply. At the time of writing this node and its reply were not actually updated.	[reply] [d/l] [select]
Re^4: How to parse not closed HTML tags that don't have any attributes? (updated) by Rantanplan (Novice) on Mar 07, 2021 at 16:40 UTC
Re^5: How to parse not closed HTML tags that don't have any attributes? by haukex (Archbishop) on Mar 07, 2021 at 17:47 UTC
Some notes below your chosen depth have not been shown here
Re: How to parse not closed HTML tags that don't have any attributes? by Rantanplan (Novice) on Mar 06, 2021 at 22:25 UTC
Hi, many thanks to all of you for your fast replies and for all this great advice! My HTML is indeed very "lazy", meaning that a lot of things that aren't 100 % necessary seem to have been omitted. There are about 10,000 *.html files containing company info, each with a few sections like the one shown, plus various other stuff in them as well. I'm going to be happy if at the end, I can manage to extract (company name + telephone number + fax number + street number + street + ZIP code + city name) into a CSV file, everything separated by commas. Now I'm going to test a bit, helped by the wonderful input from your side, many thanks!!	[reply]
Re^2: How to parse not closed HTML tags that don't have any attributes? by haukex (Archbishop) on Mar 07, 2021 at 06:34 UTC
There are about 10,000 .html files* As I hinted, make sure to get a representative sampling of all of this input for your test cases. into a CSV file Definitely use Text::CSV (also install Text::CSV_XS for speed).	[reply]
Re^2: How to parse not closed HTML tags that don't have any attributes? by Bod (Parson) on Mar 06, 2021 at 22:36 UTC
Now I'm going to test a bit Please do let us know how you get on.	[reply]
Re: How to parse not closed HTML tags that don't have any attributes? by tybalt89 (Monsignor) on Mar 06, 2021 at 23:24 UTC
`#!/usr/bin/perl use strict; use warnings; local $_ = do { local $/; <DATA> }; while( /<p class="title">(\w+)<\/p>\s<p>([^<>])/g ) { my $title = $1; printf "%20s %s", $title, $2 =~ s/\s*\z/\n/r; } __DATA__ <div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p> </div>` [download] Outputs: `Telephone 0123-4 56 78 90 Telefax` [download] Well, it works for all the provided test cases :)	[reply] [d/l] [select]
Re: How to parse not closed HTML tags that don't have any attributes? by jcb (Parson) on Mar 06, 2021 at 21:41 UTC
I suggest `HTML::Parser` and a state machine. You want three states: idle find telephone number item extract telephone number Start in idle state and transition to "find telephone number item" when you get a `start` event for a `P` tag with `class="title"`, then transition from that to "extract telephone number" when you get a `text` event containing "Telephone", otherwise return to idle state at the next `text` event. In "extract telephone number" state, store away the phone number at the first `text` event matching `m/[[:digit:]]/` and return to idle state. If you only have one telephone number per page, you can also abort the parse at that point. See the documentation for HTML::Parser for more details about that module and any good computer science text for more details about using finite state machines as parsers.	[reply] [d/l] [select]
Re: How to parse not closed HTML tags that don't have any attributes? by bliako (Monsignor) on Mar 06, 2021 at 22:13 UTC
There is method in madness: today, most html is not handcrafted but machine-made via code. In your case the code looks broken (as opposed to the html you have shown being broken - which it is). That can be to your advantage as it may be broken in a consistent way. Perhaps a rare case where regex can indeed parse (broken) html? bw, bliako	[reply]
Re^2: How to parse not closed HTML tags that don't have any attributes? by haukex (Archbishop) on Mar 07, 2021 at 06:32 UTC
Perhaps a rare case where regex can indeed parse (broken) html? No. Sorry for the direct answer, of course it is IMHO, but in this case my opinon happens to be fairly strong `:-)` today, most html is not handcrafted but machine-made via code. ... which means that browsers and other HTML parsers still have to deal with broken, hand-crafted HTML, even today. For example, HTML::Parser is now over 25 years old (it used to be known as `HTML::Parse` and was part of libwww-perl for a while), so it was written during the days where hand-crafted HTML was the norm, and it'll happily handle the broken HTML as well. Note that it's also the basis used in many other modules, like HTML::TreeBuilder and WWW::Mechanize. `use warnings; use strict; use HTML::TreeBuilder::XPath; my $p = HTML::TreeBuilder::XPath->new; $p->parse(<<'HTML'); <div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p> </div> HTML my %hash = map { $_->as_trimmed_text } $p->findnodes('//*[@class="phone"]/p'); use Data::Dump; dd \%hash; __END__ { Telefax => "", Telephone => "0123-4 56 78 90" }` [download] Of course, there may still be exceptions that even parsers can't handle. For example, say something like "`<p Text</p>`" - though of course a browser won't display this correctly, so even someone making a typo like this when writing HTML by hand would hopefully notice, plus, even here I would suggest first fixing the string with a regex and then sending it through an HTML parser.	[reply] [d/l] [select]
Re: How to parse not closed HTML tags that don't have any attributes? by Anonymous Monk on Mar 06, 2021 at 20:31 UTC
Re: htmltreexpather.pl - xpath helper, creates xpath search strings from html ($VERSION = 20120112 )	[reply]

Username:
Password:


Don't ask to ask, just ask
	PerlMonks

This is PerlMonks "Mobile"