Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Passing a list as a subroutine's return value ( list of HTML::Element objects )

by mldvx4 (Friar)
on Sep 27, 2023 at 08:28 UTC ( [id://11154682] : perlquestion . print w/replies, xml ) Need Help??

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:


I would like to pass a list of HTML::Element objects as a return value from a subroutine. I thought since the list would be just a list, that the normal approach would apply. Clearly I am misunderstanding something, or perhaps there is a more appropriate approach:

#!/usr/bin/perl use HTML::TreeBuilder::XPath; use HTML::Element; use HTML::Entities qw(decode_entities); use Data::Dumper; use strict; use warnings; my $html = &layer(3); print $html->as_XML,"\n"; exit(0); sub layer { my ($layer) = (@_); my $ul = HTML::Element->new('ul'); my $li = HTML::Element->new('li'); $li->push_content("Layer $layer"); $ul->push_content($li); if($layer--) { my $h = &layer($layer); my @c = &unescape_entities($h); # offending line print "Wrong structure:\n",Dumper(@c),"\n ----\n"; exit(1); $ul->push_content($h); } else { my $foo = ' foo < bar'; my $literal = HTML::Element->new('~literal', text=>$foo); $li->push_content($literal); $ul->push_content($li); } return($ul); } sub unescape_entities { my ($html) = (@_); my $tmp = HTML::TreeBuilder::XPath->new; $tmp->parse(decode_entities($html->as_XML)); my @c = $tmp->findnodes('//body/*'); print "Right structure:\n", Dumper(@c),"\n ----\n"; $tmp->delete; return(@c); # this is getting transformed }

I would expect that the script (minus the exit) to produce a nested HTML unordered list. What happens is that the variable actually recovered from the subroutine is basically empty, as seen by comparing the output from the two Dump calls.

I've looked at the manual pages for the modules given above as well as the one for perlrref. Please nudge me in the right direction.

Replies are listed 'Best First'.
Re: Passing a list as a subroutine's return value ( list of HTML::Element objects )
by tobyink (Canon) on Sep 27, 2023 at 09:14 UTC

      Not quite that simple, cause you now have a memory leak. The entire tree is never freed.

      Is it enough? Declaring $tmp outside the unescape_entitites sub makes the output even larger and identical to the expected one. So it seems you need to have the HTML::TreeBuilder::XPath object around to be able to access the nodes it matched.

      Update: it seems the difference is the _parent contents which just contains the XPath internals. Disregard this node.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]


      How can one tell when a 'new' variable is somehow connected to another?

        In general, you don't have to worry about that. As long as you hold a reference, the object will be kept alive. But you called a function which frees the tree including all of its nodes, which it does by hollowing out any existing references to its nodes.

Re: Passing a list as a subroutine's return value ( list of HTML::Element objects )
by ikegami (Patriarch) on Sep 27, 2023 at 14:10 UTC

    You are freeing the tree including all of its nodes before you're done with the nodes.

    But it turns out you don't need to return the nodes. You just want the stringification of the elements, so stringify them in the sub instead of in the caller.

    sub unescape_entities { my $html = shift; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse( decode_entities( $html->as_XML ) ); my @c = map "$_", $tree->findnodes( '//body/*' ); $tree->delete; return @c; }

    Better yet (since it properly handles exceptions):

    use Sub::ScopeFinalizer qw( scope_finalizer ); sub unescape_entities { my $html = shift; my $tree = HTML::TreeBuilder::XPath->new; my $anchor = scope_finalizer { $tree->delete; }; $tree->parse( decode_entities( $html->as_XML ) ); return map "$_", $tmp->findnodes( '//body/*' ); }

    Note: Does passing XML to HTML::TreeBuilder::XPath make sense???

      That works well. It also led to a third way. As a result, to put it into a larger context, I now have a working script which can access MySQL tables from a defunct WordPress instance and produce static HTML in a file system hierarchy matching the structure in the original WordPress URLs. That's not generic enough to package, but might get shown to the public later this year.

      Going slightly off-topic, one hard part was that WordPress comments are nested and SQL is not as suited for that as a regular key-value database. So a recursive query was needed. I had to cargo cult that, but creating a matching Perl routine to process the recursive query (once I had the query) was rather easy.

      . . . my $query = qq( with recursive cte (comment_ID, comment_post_ID, comment_author +, comment_parent, comment_date_gmt, comment_type, comment_con +tent) as ( select comment_ID, comment_post_ID, comment_author, comment_parent, comment_date_gmt, comment_type, comment_content from wp_comments where comment_ID = ? AND comment_approved = 1 union all select p.comment_ID, p.comment_post_ID, p.comment_author, p.comment_parent, p.comment_date_gmt, p.comment_type, p.comment_content from wp_comments p inner join cte on p.comment_parent = cte.comment_ID ) SELECT * FROM cte ORDER BY comment_date_gmt; ); my $sth = $dbh->prepare($query); $sth->execute($id); while(my $row = $sth->fetchrow_hashref) { my $cid = $row->{comment_ID}; my $parent_id = $row->{comment_parent}; my $post_id = $row->{comment_post_ID}; if ($parent_id eq 0) { push(@{$posts{$post_id}}, $cid); } $comments{$cid}->{comment_post_ID} = $row->{comment_post_ID}; $comments{$cid}->{comment_parent} = $row->{comment_parent}; $comments{$cid}->{comment_author} = $row->{comment_author}; $comments{$cid}->{comment_date_gmt} = $row->{comment_date_gmt} +; my $content = $row->{comment_content}; $content =~ s|(\s*)\n(\s*)\n|$1<br />\n$2<br />\n|gm; $comments{$cid}->{comment_content} = $content; push (@{$hierarchy{$parent_id}}, $cid); } $sth->finish(); . . .

      The matching subroutine has to be recursive:

      sub render { my ($layer, $k, $comments, $hierarchy) = (@_); my $comment = $hierarchy{$k}; if (!defined($comment)){ return(0); } $layer++; my $ul = HTML::Element->new('ul'); my $count = 0; foreach my $c (@{$comment}) { my $li = HTML::Element->new('li'); $li->attr('id', "comment$c"); my $p1 = HTML::Element->new('p'); $p1->attr('class','author'); $p1->push_content($comments{$c}->{comment_author} ); $li->push_content($p1); my $p2 = HTML::Element->new('p'); $p2->attr('class','date'); $p2->push_content($comments{$c}->{comment_date_gmt}); $li->push_content($p2); my $div = HTML::Element->new('div'); $div->attr('class','words'); $div->push_content($comments{$c}->{comment_content}); $li->push_content($div); $ul->push_content($li); print "."x$layer,$c,"\n" if ($VERBOSE); my $html = &render($layer, $c, $comments, $hierarchy); if ($html) { $ul->push_content($html); $count++; } } return($ul); }

      The stringification now happens higher up in the work flow.

      Thanks, and thanks to all who read and or replied.