My favourite module for parsing HTML is HTML::TreeBuilder::XPath, but it misses out on the first div (id=Zero). It uses HTML::Parser internally but I could not find a way to pass the necessary attribute
empty_element_tags=>1 from HTML::TreeBuilder to HTML::Parser.
So here is a fairly verbose version using just HTML::Parser:
use HTML::Parser;
my $file = 'example.html';
my ($in_div,$in_wanted_div) = (0,0);
my @result;
my $parser = HTML::Parser->new(
api_version => 3,
start_h => [\&start, "tagname, attr"],
text_h => [\&text, "dtext"],
end_h => [\&end, "tagname"],
empty_element_tags => 1,
);
$parser->parse_file($file);
print join(', ',@result);
sub start {
my ($tag, $attr) = @_;
return unless ($tag eq 'div');
if (exists $attr->{'class'} and $attr->{'class'} eq 'data') {
$in_div = 1;
$in_wanted_div = 1;
push(@result, "$attr->{'id'}=");
}
else {
$in_div++;
}
}
sub text {
my ($text) = @_;
return unless $in_wanted_div;
$text =~ s/\W//g;
$result[-1] .= $text;
}
sub end {
my ($tag) = @_;
return unless ($tag eq 'div');
$in_div--;
$in_wanted_div = 0 if not $in_div;
}
Output:
Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F
+riday, Six=Saturday, Seven=Sunday
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.