You are barking up the wrong tree.
You could make that approach work correctly, but taking data that has already been parsed (by HTML::TreeBuilder in this case), dumping it to an unparsed format (via as_HTML), and reparsing it (via regexes), is a red flag.
Even if it was not a bad idea in general, as_HTML does not always output the one-tag-per-line format that your code would need.
Your task is complicated by the UL&LI tags not occurring within the SPAN tag. By the time you are processing a LI tag, the author in the previous SPAN tag cannot be directly accessed, since the SPAN is before the LI, but not a parent of LI.
Your impulse to iterate over the tags is good. The "my $author;" line would have to be outside the while() loop, though.
find_by_tag_name() accepts multiple tag names, and so will do what you need.
Working, tested code:
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TreeBuilder;
use Data::Dumper; $Data::Dumper::Sortkeys = 1;
my $tree = HTML::TreeBuilder->new;
$tree->parse( <<'END_OF_HTML' );
<span> Author_name </span>
__filler__
<ul>
<li> book 1 by Author_name </li>
<li> book 2 by Author_name </li>
</ul>
<span> New_Author </span>
__filler__
<ul>
<li> book 1 by new </li>
</ul>
END_OF_HTML
$tree->eof;
# Uncomment to show that as_HTML is a bad fit for this task.
# open my $fh , '<', \( $tree->as_HTML('', ' ') ) or die;
# print $_ while <$fh>;
# exit;
my @tags = $tree->find_by_tag_name( qw( span li ) );
my $current_author;
my %book_author;
my %author_books_HoA;
for my $t (@tags) {
my $tag_name = $t->tag;
if ( $tag_name eq 'span' ) {
$current_author = $t->as_trimmed_text;
}
elsif ( $tag_name eq 'li' ) {
next unless $t->parent->tag eq 'ul';
my $book_title = $t->as_trimmed_text;
warn if exists $book_author{$book_title};
$book_author{$book_title} = $current_author;
push @{ $author_books_HoA{$current_author} }, $book_title;
}
else {
die "Unexpected tag $tag_name"
}
}
print Dumper \%book_author, \%author_books_HoA;
Output: $VAR1 = {
'book 1 by Author_name' => 'Author_name',
'book 1 by new' => 'New_Author',
'book 2 by Author_name' => 'Author_name'
};
$VAR2 = {
'Author_name' => [
'book 1 by Author_name',
'book 2 by Author_name'
],
'New_Author' => [
'book 1 by new'
]
};
/em |