Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: How to parse not closed HTML tags that don't have any attributes?

by marto (Cardinal)
on Mar 06, 2021 at 20:01 UTC ( #11129209=note: print w/replies, xml ) Need Help??


in reply to How to parse not closed HTML tags that don't have any attributes?

Consider this Mojo::DOM example, I've made some assumptions as your source data does not seem complete:

cat dragnet.pl #!/usr/bin/perl use strict; use warnings; use Mojo::DOM; use feature 'say'; my $html = '<div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90</p> <p class="title">Telefax</p> <p>just the fax ma\'am</p> </div>'; my $dom = Mojo::DOM->new( $html ); my $phone = $dom->at('div.phone > p:nth-of-type(2)')->text; say $phone; my $fax = $dom->at('div.phone > p:nth-of-type(4)')->text; say $fax;

Prints:

0123-4 56 78 90 just the fax ma'am

Let us know if you have any problems or your input data is somehow weirder.

Update: Sorry, late in the day on a Saturday here. Since the HTML isn't valid, and I'm guessing you can't change that try:

#!/usr/bin/perl use strict; use warnings; use Mojo::DOM; use Mojo::Util qw(trim); use feature 'say'; my $html = '<div class="phone"> <div class="icon"></div> <p class="title">Telephone</p> <p>0123-4 56 78 90 <p class="title">Telefax</p> <p>just the fax ma\'am </div>'; my $dom = Mojo::DOM->new( $html ); my $phone = trim( $dom->at('div.phone > p:nth-of-type(2)')->text ); say $phone; my $fax = trim( $dom->at('div.phone > p:nth-of-type(4)')->text ); say $fax;

Which still prints:

0123-4 56 78 90 just the fax ma'am

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11129209]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2021-04-10 20:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?