Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: How to parse not closed HTML tags that don't have any attributes? (updated)

by haukex (Bishop)
on Mar 06, 2021 at 20:07 UTC ( #11129210=note: print w/replies, xml ) Need Help??


in reply to How to parse not closed HTML tags that don't have any attributes?

The HTML is indeed brokeninconsistent, and you've only shown one sample, so any example code will be correspondingly brittle. Like marto, I would suggest Mojo::DOM, as it has an IMHO nice interface, and it is still able to parse that HTML.

use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; use Data::Dump; my $dom = Mojo::DOM->new(<<'HTML'); <div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p> </div> HTML my %hash = @{ $dom->find('p.title')->map(sub { return ( trim($_->text), trim($_->next->text) ) }) }; dd \%hash; __END__ { Telefax => "", Telephone => "0123-4 56 78 90" }

Update: Assuming you've got a lot of other <div>s in your HTML, you may want to change the expression in ->find() to '.phone p.title'.

Replies are listed 'Best First'.
Re^2: How to parse not closed HTML tags that don't have any attributes? (updated)
by tobyink (Canon) on Mar 08, 2021 at 13:24 UTC

    The HTML looks fine to me. Many HTML elements have optional end tags. <p> is one of them. The following is valid HTML:

    <p>I like: <ul> <li>Chocolate <li>Pizza <li>Mexican food </ul> <p>Yet I don't like Mexican chocolate pizza!
      Many HTML elements have optional end tags.

      Yes, I know - the reason I called it "broken", which admittedly was probably too strong a word (parent updated), is because of the inconsistent closing tags, suggesting at the very least that not very much attention to detail was paid in the production of the HTML.

Re^2: How to parse not closed HTML tags that don't have any attributes? (updated)
by Rantanplan (Novice) on Mar 07, 2021 at 13:52 UTC

    Many thanks again to everyone for all your great help!

    All your solutions are very tempting. Especially regexp code always looks like pure magic to me. :-)

    For the moment I've decided to go with the Mojo::DOM alternative, since I'm still very unexperienced with Perl, and since it's understandable for me at least to a little extent.

    So far it gives me really promising results. There's this wall I ran into, though:

    <div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div>

    In there, the fields "Street name", "Street number", "ZIP code" and "City name" have been carelessly filled into just a single field, separated by the <br/> element.

    With your help I'm now able to access the whole string with $dom->find('address'), but no matter what I do, the <br/> element in it always gets removed, so it seems to me that I cannot search inside the address string. I thought this might be because Perl treats it as white space, but I wasn't able to find anything useful.

    Could you please give me a hint?

    By the way, thank you for your advice to use Text::CSV. That's a great idea, and I will definitely do that!
      no matter what I do, the <br/> element in it always gets removed, so it seems to me that I cannot search inside the address string.

      I can't reproduce this (see the output in the [[]]s below), and you haven't said what your expected output is or what you mean by "search inside the address string" - see How do I post a question effectively? and Short, Self-Contained, Correct Example. As an example, I can replace the <br/> like so:

      use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; my $dom = Mojo::DOM->new(<<'HTML'); <div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div> HTML my $addr = $dom->find('.address address')->first; print "[[$addr]]\n"; $addr->find('br')->map('replace',"\n"); print "[", trim($addr->text), "]\n"; __END__ [[<address> Sample Street 123<br>45678 Randomcity </address>] +] [Sample Street 123 45678 Randomcity]

      Edit: Forgot to remove the "(updated)" from the node's title before it got a reply. At the time of writing this node and its reply were not actually updated.

        Many thanks haukex!

        That it's not reproducable is due to my own terrible incompetence. :-) I had tried to modify your example for the phone/fax section in such a way, that it would put these pairs into %hash:

        {"Street name" => "Sample Street", "House number" => "123", "ZIP Code" => "45678", "City name" => "Randomcity"}

        With all the things I had tried, I only managed to get the string "Sample Street 12345678 Randomcity" into one of the fields, and the other one then was left empty, like:

        {"Sample Street 12345678 Randomcity" => ""}

        I guess my main mistake was to assume, that it's necessary to start out from the $dom all over again, for each and every HTML element. The crazy idea I had was to somehow grab "Sample Street 123" into one variable (starting from the "address" element), and "45678 Randomcity" into another, by somehow targeting, and starting from, the first <br/> element after the "address" element.

        I'm still not sure why my <br/> always got stripped away, maybe because of my misunderstanding of how the "map" works:

        use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; use Data::Dump; my $dom = Mojo::DOM->new(<<'HTML'); <div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div> HTML my %hash_address = @{ $dom->find('address')->map(sub { return ( trim($_->text), "This_is_the_address_content" ) }) }; dd \%hash_address; __END__ { "Sample Street 12345678 Randomcity" => "This_is_the_address_content" +, }

        Your solution is very elegant indeed, many thanks! :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11129210]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2021-04-15 04:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?