Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^2: How to parse not closed HTML tags that don't have any attributes? (updated)

by Rantanplan (Novice)
on Mar 07, 2021 at 13:52 UTC ( [id://11129269]=note: print w/replies, xml ) Need Help??


in reply to Re: How to parse not closed HTML tags that don't have any attributes? (updated)
in thread How to parse not closed HTML tags that don't have any attributes?

Many thanks again to everyone for all your great help!

All your solutions are very tempting. Especially regexp code always looks like pure magic to me. :-)

For the moment I've decided to go with the Mojo::DOM alternative, since I'm still very unexperienced with Perl, and since it's understandable for me at least to a little extent.

So far it gives me really promising results. There's this wall I ran into, though:

<div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div>

In there, the fields "Street name", "Street number", "ZIP code" and "City name" have been carelessly filled into just a single field, separated by the <br/> element.

With your help I'm now able to access the whole string with $dom->find('address'), but no matter what I do, the <br/> element in it always gets removed, so it seems to me that I cannot search inside the address string. I thought this might be because Perl treats it as white space, but I wasn't able to find anything useful.

Could you please give me a hint?

By the way, thank you for your advice to use Text::CSV. That's a great idea, and I will definitely do that!

Replies are listed 'Best First'.
Re^3: How to parse not closed HTML tags that don't have any attributes?
by haukex (Archbishop) on Mar 07, 2021 at 14:03 UTC
    no matter what I do, the <br/> element in it always gets removed, so it seems to me that I cannot search inside the address string.

    I can't reproduce this (see the output in the [[]]s below), and you haven't said what your expected output is or what you mean by "search inside the address string" - see How do I post a question effectively? and Short, Self-Contained, Correct Example. As an example, I can replace the <br/> like so:

    use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; my $dom = Mojo::DOM->new(<<'HTML'); <div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div> HTML my $addr = $dom->find('.address address')->first; print "[[$addr]]\n"; $addr->find('br')->map('replace',"\n"); print "[", trim($addr->text), "]\n"; __END__ [[<address> Sample Street 123<br>45678 Randomcity </address>] +] [Sample Street 123 45678 Randomcity]

    Edit: Forgot to remove the "(updated)" from the node's title before it got a reply. At the time of writing this node and its reply were not actually updated.

      Many thanks haukex!

      That it's not reproducable is due to my own terrible incompetence. :-) I had tried to modify your example for the phone/fax section in such a way, that it would put these pairs into %hash:

      {"Street name" => "Sample Street", "House number" => "123", "ZIP Code" => "45678", "City name" => "Randomcity"}

      With all the things I had tried, I only managed to get the string "Sample Street 12345678 Randomcity" into one of the fields, and the other one then was left empty, like:

      {"Sample Street 12345678 Randomcity" => ""}

      I guess my main mistake was to assume, that it's necessary to start out from the $dom all over again, for each and every HTML element. The crazy idea I had was to somehow grab "Sample Street 123" into one variable (starting from the "address" element), and "45678 Randomcity" into another, by somehow targeting, and starting from, the first <br/> element after the "address" element.

      I'm still not sure why my <br/> always got stripped away, maybe because of my misunderstanding of how the "map" works:

      use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; use Data::Dump; my $dom = Mojo::DOM->new(<<'HTML'); <div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div> HTML my %hash_address = @{ $dom->find('address')->map(sub { return ( trim($_->text), "This_is_the_address_content" ) }) }; dd \%hash_address; __END__ { "Sample Street 12345678 Randomcity" => "This_is_the_address_content" +, }

      Your solution is very elegant indeed, many thanks! :-)

        I'm still not sure why my <br/> always got stripped away,

        In the code you show above, ->find('address') is finding the <address> element, and then inside the ->map(sub { ... }), $_ is referring to that element, of which $_->text is getting only the text content, hence the missing <br/>. In the code I showed two nodes above, first I'm getting the <address> element into $addr, which preserves the document's structure, replacing the <br/>, and only then using ->text to get the text content.

        I guess my main mistake was to assume, that it's necessary to start out from the $dom all over again, for each and every HTML element.

        ->find will use whatever node you call it on as the context, so it depends on what part of the document you want to search and where in the document the nodes you're looking for can occur.

        The crazy idea I had was to somehow grab "Sample Street 123" into one variable (starting from the "address" element), and "45678 Randomcity" into another, by somehow targeting, and starting from, the first <br/> element after the "address" element.

        It's possible, sure - in the Document Object Model, the <address> element has three children: a text node "Sample Street 123", the <br/> element, and another text node "45678 Randomcity" - you'll see this if you try looking at $addr->child_nodes.

        But I think this goes back to what I was saying about example code being brittle if written based on too few examples, and writing lots of test cases: so far, you've only shown two snippets of data out of what you said are 10,000 *.html files. So for example, marto's code makes the assumption that the phone and fax will always be the 2nd and 4th <p>s, respectively, my code here makes the assumption that it's always the next node after the <p class="title"> that will contain the data (and that there are no double keys in the hash, and one or two other assumptions), my code here assumes that any element of class="address" contains only one <address> element that we're interested in, my code here assumes that the <p>s in elements of class="phone" are always in key+value pairs, and so on.

        My suggestions would be for you to first survey your input files, and see how much variation there is, so that you can boil it down to a representative set of test cases, and to code defensively, i.e. testing all of the assumptions I named above. Here's what that could look like:

        use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; # this sub should really be in its own package for modularity sub get_data { my $html = shift; my %data; my $dom = Mojo::DOM->new($html); my $addr = $dom->find('.address address'); # could add some conditionals here # in case there are separate fields for street / city / zip etc. die "Didn't find exactly one address" unless @$addr==1; $addr = $addr->first; $addr->find('br')->map('replace',"\n"); $data{address} = { Address => trim( $addr->text ) }; my $phone = $dom->find('.phone p'); die "Didn't find an even number of elements in phone" if @$phone%2; while (@$phone) { my $key = trim( shift(@$phone)->text ); die "Duplicate key '$key' in phone data" if exists $data{phone}{$key}; $data{phone}{$key} = trim( shift(@$phone)->text ); } return \%data; } use Test::More; is_deeply get_data(<<'HTML'), <div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div> <div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p> </div> HTML { address => { Address => "Sample Street 123\n45678 Randomcity" }, phone => { Telephone => "0123-4 56 78 90", Telefax => "" }, }; # TODO: many more test cases here done_testing;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11129269]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2024-04-19 11:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found