Re: How to parse not closed HTML tags that don't have any attributes? (updated)

Replies are listed 'Best First'.
Re^2: How to parse not closed HTML tags that don't have any attributes? (updated) by tobyink (Canon) on Mar 08, 2021 at 13:24 UTC
The HTML looks fine to me. Many HTML elements have optional end tags. `<p>` is one of them. The following is valid HTML: `<p>I like: <ul> <li>Chocolate <li>Pizza <li>Mexican food </ul> <p>Yet I don't like Mexican chocolate pizza!` [download] Hire me at Toby Ink Ltd or Join my OnlyFans	[reply] [d/l] [select]
Re^3: How to parse not closed HTML tags that don't have any attributes? by haukex (Archbishop) on Mar 08, 2021 at 15:30 UTC
Many HTML elements have optional end tags. Yes, I know - the reason I called it "broken", which admittedly was probably too strong a word (parent updated), is because of the inconsistent closing tags, suggesting at the very least that not very much attention to detail was paid in the production of the HTML.	[reply]
Re^2: How to parse not closed HTML tags that don't have any attributes? (updated) by Rantanplan (Novice) on Mar 07, 2021 at 13:52 UTC
Many thanks again to everyone for all your great help! All your solutions are very tempting. Especially regexp code always looks like pure magic to me. :-) For the moment I've decided to go with the Mojo::DOM alternative, since I'm still very unexperienced with Perl, and since it's understandable for me at least to a little extent. So far it gives me really promising results. There's this wall I ran into, though: `<div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div>` [download] In there, the fields "Street name", "Street number", "ZIP code" and "City name" have been carelessly filled into just a single field, separated by the `<br/>` element. With your help I'm now able to access the whole string with `$dom->find('address')`, but no matter what I do, the `<br/>` element in it always gets removed, so it seems to me that I cannot search inside the address string. I thought this might be because Perl treats it as white space, but I wasn't able to find anything useful. Could you please give me a hint? By the way, thank you for your advice to use Text::CSV. That's a great idea, and I will definitely do that!	[reply] [d/l] [select]
Re^3: How to parse not closed HTML tags that don't have any attributes? by haukex (Archbishop) on Mar 07, 2021 at 14:03 UTC
no matter what I do, the `<br/>` element in it always gets removed, so it seems to me that I cannot search inside the address string. I can't reproduce this (see the output in the `[[]]`s below), and you haven't said what your expected output is or what you mean by "search inside the address string" - see How do I post a question effectively? and Short, Self-Contained, Correct Example. As an example, I can replace the `<br/>` like so: `use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; my $dom = Mojo::DOM->new(<<'HTML'); <div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div> HTML my $addr = $dom->find('.address address')->first; print "[[$addr]]\n"; $addr->find('br')->map('replace',"\n"); print "[", trim($addr->text), "]\n"; __END__ [[<address> Sample Street 123<br>45678 Randomcity </address>] +] [Sample Street 123 45678 Randomcity]` [download] Edit: Forgot to remove the "(updated)" from the node's title before it got a reply. At the time of writing this node and its reply were not actually updated.	[reply] [d/l] [select]
Re^4: How to parse not closed HTML tags that don't have any attributes? (updated) by Rantanplan (Novice) on Mar 07, 2021 at 16:40 UTC
Many thanks haukex! That it's not reproducable is due to my own terrible incompetence. :-) I had tried to modify your example for the phone/fax section in such a way, that it would put these pairs into %hash: `{"Street name" => "Sample Street", "House number" => "123", "ZIP Code" => "45678", "City name" => "Randomcity"}` [download] With all the things I had tried, I only managed to get the string "Sample Street 12345678 Randomcity" into one of the fields, and the other one then was left empty, like: `{"Sample Street 12345678 Randomcity" => ""}` [download] I guess my main mistake was to assume, that it's necessary to start out from the $dom all over again, for each and every HTML element. The crazy idea I had was to somehow grab "Sample Street 123" into one variable (starting from the "address" element), and "45678 Randomcity" into another, by somehow targeting, and starting from, the first `<br/>` element after the "address" element. I'm still not sure why my `<br/>` always got stripped away, maybe because of my misunderstanding of how the "map" works: `use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; use Data::Dump; my $dom = Mojo::DOM->new(<<'HTML'); <div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div> HTML my %hash_address = @{ $dom->find('address')->map(sub { return ( trim($_->text), "This_is_the_address_content" ) }) }; dd \%hash_address; __END__ { "Sample Street 12345678 Randomcity" => "This_is_the_address_content" +, }` [download] Your solution is very elegant indeed, many thanks! :-)	[reply] [d/l] [select]
Re^5: How to parse not closed HTML tags that don't have any attributes? by haukex (Archbishop) on Mar 07, 2021 at 17:47 UTC
Re^6: How to parse not closed HTML tags that don't have any attributes? by Rantanplan (Novice) on Mar 08, 2021 at 14:39 UTC
Some notes below your chosen depth have not been shown here

Username:
Password:


XP is just a number
	PerlMonks

This is PerlMonks "Mobile"