Re: Parsing HTML/XML with Regular Expressions (HTML::TreeBuilder::XPath)

in reply to Parsing HTML/XML with Regular Expressions

In my previous comment I mentioned that I could not find a way to pass the attribute empty_element_tags from HTML::TreeBuilder to HTML::Parser. Looking at the source code for HTML::TreeBuilder I found this:

our @ISA = qw(HTML::Element HTML::Parser);

# This looks schizoid, I know...
[download]

So I've learnt something there! I can call empty_element_tags(1) and now it works.

use HTML::TreeBuilder::XPath;

my $file = 'example.html';
my @result;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->empty_element_tags(1); # calls this on HTML::Parser
$tree->parse_file($file);
$tree->eof;

my @divs = $tree->findnodes('//div[@class="data"]');

for my $div (@divs) {
    my $text = $div->as_text || '';
    $text =~ s/\W//g;
    push(@result, $div->attr('id') . "=$text");
}

print join(', ',@result);
[download]

Output:

Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F
+riday, Six=Saturday, Seven=Sunday
[download]

Comment on Re: Parsing HTML/XML with Regular Expressions (HTML::TreeBuilder::XPath) Select or Download Code

Replies are listed 'Best First'.
Re^2: Parsing HTML/XML with Regular Expressions (HTML::TreeBuilder::XPath) by fishy (Friar) on Oct 18, 2017 at 07:08 UTC
Great! Thanks.	[reply]

In Section Meditations