This is PerlMonks "Mobile"

Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Rantanplan has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'd like to grab phone number and fax number strings from this HTML section:
<div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p> </div>

Unfortunately, there may, or may not be entries in the fields for the phone and the fax numbers.

I have tried HTML::TreeBuilder with find_by_attribute and look_down, but can't figure out how to do it.

Could someone help me please? Many thanks!

Replies are listed 'Best First'.
Re: How to parse not closed HTML tags that don't have any attributes?
by marto (Cardinal) on Mar 06, 2021 at 20:01 UTC

    Consider this Mojo::DOM example, I've made some assumptions as your source data does not seem complete:

    cat dragnet.pl #!/usr/bin/perl use strict; use warnings; use Mojo::DOM; use feature 'say'; my $html = '<div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90</p> <p class="title">Telefax</p> <p>just the fax ma\'am</p> </div>'; my $dom = Mojo::DOM->new( $html ); my $phone = $dom->at('div.phone > p:nth-of-type(2)')->text; say $phone; my $fax = $dom->at('div.phone > p:nth-of-type(4)')->text; say $fax;

    Prints:

    0123-4 56 78 90 just the fax ma'am

    Let us know if you have any problems or your input data is somehow weirder.

    Update: Sorry, late in the day on a Saturday here. Since the HTML isn't valid, and I'm guessing you can't change that try:

    #!/usr/bin/perl use strict; use warnings; use Mojo::DOM; use Mojo::Util qw(trim); use feature 'say'; my $html = '<div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p>just the fax ma\'am </div>'; my $dom = Mojo::DOM->new( $html ); my $phone = trim( $dom->at('div.phone > p:nth-of-type(2)')->text ); say $phone; my $fax = trim( $dom->at('div.phone > p:nth-of-type(4)')->text ); say $fax;

    Which still prints:

    0123-4 56 78 90 just the fax ma'am
Re: How to parse not closed HTML tags that don't have any attributes? (updated)
by haukex (Archbishop) on Mar 06, 2021 at 20:07 UTC

    The HTML is indeed brokeninconsistent, and you've only shown one sample, so any example code will be correspondingly brittle. Like marto, I would suggest Mojo::DOM, as it has an IMHO nice interface, and it is still able to parse that HTML.

    use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; use Data::Dump; my $dom = Mojo::DOM->new(<<'HTML'); <div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p> </div> HTML my %hash = @{ $dom->find('p.title')->map(sub { return ( trim($_->text), trim($_->next->text) ) }) }; dd \%hash; __END__ { Telefax => "", Telephone => "0123-4 56 78 90" }

    Update: Assuming you've got a lot of other <div>s in your HTML, you may want to change the expression in ->find() to '.phone p.title'.

      The HTML looks fine to me. Many HTML elements have optional end tags. <p> is one of them. The following is valid HTML:

      <p>I like: <ul> <li>Chocolate <li>Pizza <li>Mexican food </ul> <p>Yet I don't like Mexican chocolate pizza!
        Many HTML elements have optional end tags.

        Yes, I know - the reason I called it "broken", which admittedly was probably too strong a word (parent updated), is because of the inconsistent closing tags, suggesting at the very least that not very much attention to detail was paid in the production of the HTML.

      Many thanks again to everyone for all your great help!

      All your solutions are very tempting. Especially regexp code always looks like pure magic to me. :-)

      For the moment I've decided to go with the Mojo::DOM alternative, since I'm still very unexperienced with Perl, and since it's understandable for me at least to a little extent.

      So far it gives me really promising results. There's this wall I ran into, though:

      <div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div>

      In there, the fields "Street name", "Street number", "ZIP code" and "City name" have been carelessly filled into just a single field, separated by the <br/> element.

      With your help I'm now able to access the whole string with $dom->find('address'), but no matter what I do, the <br/> element in it always gets removed, so it seems to me that I cannot search inside the address string. I thought this might be because Perl treats it as white space, but I wasn't able to find anything useful.

      Could you please give me a hint?

      By the way, thank you for your advice to use Text::CSV. That's a great idea, and I will definitely do that!
        no matter what I do, the <br/> element in it always gets removed, so it seems to me that I cannot search inside the address string.

        I can't reproduce this (see the output in the [[]]s below), and you haven't said what your expected output is or what you mean by "search inside the address string" - see How do I post a question effectively? and Short, Self-Contained, Correct Example. As an example, I can replace the <br/> like so:

        use warnings; use strict; use Mojo::DOM; use Mojo::Util qw/trim/; my $dom = Mojo::DOM->new(<<'HTML'); <div class="address"> <div class="icon"></div> <address> Sample Street 123<br/>45678 Randomcity </address> </div> HTML my $addr = $dom->find('.address address')->first; print "[[$addr]]\n"; $addr->find('br')->map('replace',"\n"); print "[", trim($addr->text), "]\n"; __END__ [[<address> Sample Street 123<br>45678 Randomcity </address>] +] [Sample Street 123 45678 Randomcity]

        Edit: Forgot to remove the "(updated)" from the node's title before it got a reply. At the time of writing this node and its reply were not actually updated.

Re: How to parse not closed HTML tags that don't have any attributes?
by Rantanplan (Novice) on Mar 06, 2021 at 22:25 UTC
    Hi,

    many thanks to all of you for your fast replies and for all this great advice!

    My HTML is indeed very "lazy", meaning that a lot of things that aren't 100 % necessary seem to have been omitted.

    There are about 10,000 *.html files containing company info, each with a few sections like the one shown, plus various other stuff in them as well. I'm going to be happy if at the end, I can manage to extract (company name + telephone number + fax number + street number + street + ZIP code + city name) into a CSV file, everything separated by commas.

    Now I'm going to test a bit, helped by the wonderful input from your side, many thanks!!

      There are about 10,000 *.html files

      As I hinted, make sure to get a representative sampling of all of this input for your test cases.

      into a CSV file

      Definitely use Text::CSV (also install Text::CSV_XS for speed).

      Now I'm going to test a bit

      Please do let us know how you get on.

Re: How to parse not closed HTML tags that don't have any attributes?
by tybalt89 (Monsignor) on Mar 06, 2021 at 23:24 UTC
    #!/usr/bin/perl use strict; use warnings; local $_ = do { local $/; <DATA> }; while( /<p class="title">(\w+)<\/p>\s*<p>([^<>]*)/g ) { my $title = $1; printf "%20s %s", $title, $2 =~ s/\s*\z/\n/r; } __DATA__ <div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p> </div>

    Outputs:

    Telephone 0123-4 56 78 90 Telefax

    Well, it works for all the provided test cases :)

Re: How to parse not closed HTML tags that don't have any attributes?
by jcb (Parson) on Mar 06, 2021 at 21:41 UTC

    I suggest HTML::Parser and a state machine.

    You want three states:

    • idle
    • find telephone number item
    • extract telephone number

    Start in idle state and transition to "find telephone number item" when you get a start event for a P tag with class="title", then transition from that to "extract telephone number" when you get a text event containing "Telephone", otherwise return to idle state at the next text event. In "extract telephone number" state, store away the phone number at the first text event matching m/[[:digit:]]/ and return to idle state. If you only have one telephone number per page, you can also abort the parse at that point.

    See the documentation for HTML::Parser for more details about that module and any good computer science text for more details about using finite state machines as parsers.

Re: How to parse not closed HTML tags that don't have any attributes?
by bliako (Monsignor) on Mar 06, 2021 at 22:13 UTC

    There is method in madness: today, most html is not handcrafted but machine-made via code. In your case the code looks broken (as opposed to the html you have shown being broken - which it is). That can be to your advantage as it may be broken in a consistent way.

    Perhaps a rare case where regex can indeed parse (broken) html?

    bw, bliako

      Perhaps a rare case where regex can indeed parse (broken) html?

      No.

      Sorry for the direct answer, of course it is IMHO, but in this case my opinon happens to be fairly strong :-)

      today, most html is not handcrafted but machine-made via code.

      ... which means that browsers and other HTML parsers still have to deal with broken, hand-crafted HTML, even today. For example, HTML::Parser is now over 25 years old (it used to be known as HTML::Parse and was part of libwww-perl for a while), so it was written during the days where hand-crafted HTML was the norm, and it'll happily handle the broken HTML as well. Note that it's also the basis used in many other modules, like HTML::TreeBuilder and WWW::Mechanize.

      use warnings; use strict; use HTML::TreeBuilder::XPath; my $p = HTML::TreeBuilder::XPath->new; $p->parse(<<'HTML'); <div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p> </div> HTML my %hash = map { $_->as_trimmed_text } $p->findnodes('//*[@class="phone"]/p'); use Data::Dump; dd \%hash; __END__ { Telefax => "", Telephone => "0123-4 56 78 90" }

      Of course, there may still be exceptions that even parsers can't handle. For example, say something like "<p Text</p>" - though of course a browser won't display this correctly, so even someone making a typo like this when writing HTML by hand would hopefully notice, plus, even here I would suggest first fixing the string with a regex and then sending it through an HTML parser.

Re: How to parse not closed HTML tags that don't have any attributes?
by Anonymous Monk on Mar 06, 2021 at 20:31 UTC