Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Replacing an HTML element with multiple elements using HTML::TreeBuilder

by mldvx4 (Friar)
on Jul 03, 2019 at 08:22 UTC ( [id://11102331]=perlquestion: print w/replies, xml ) Need Help??

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to replace selected paragraphs in place but am not figuring out the right way of doing so. Specifically I'd like to replace a specific, single P with multiple P at the same point in the tree. I've tried many dozen variations of the below code, but what I have gives an error, "the target node's parent has no content!?"

I don't understand. The replace_with_content or push_content methods should have established something for postinsert to add to. Clearly I have missed something? What though?

#!/usr/bin/perl use HTML::TreeBuilder::XPath; use warnings; use strict; &readfile; exit(0); sub readfile { my ($file)= (@_); my $xhtml = HTML::TreeBuilder::XPath->new; $xhtml->implicit_tags(1); $xhtml->no_space_compacting(1); $xhtml->parse_file(\*DATA) or die(); # find double-spaced paragraphs inside blockquotes and expand them for my $p ($xhtml->findnodes('//blockquote/p')) { my $text = $p->as_text(); $text =~ s/^\s+//; $text =~ s/\s+$//; next unless($text =~/\n\s*\n\s*/); my @paragraphs = split(/\s*\n\s*/, $text); print qq(\t\@paragraphs=),join(',',@paragraphs),qq(\n); if ($#paragraphs >= 0) { my $pp = shift(@paragraphs); print qq(\t\tpp1=$pp\n); $p->replace_with_content(); $p->push_content(['p',,$pp]); print qq(Identified :\n); print qq(«),$p->as_XML_indented,qq(»\n); foreach $pp (@paragraphs) { print qq(\t\tpp2=$pp\n); $p->postinsert(['p',,$pp]); } } } print qq(\n),qq(-)x30,qq(\n); my ($body) = $xhtml->findnodes('//body'); print qq(\n); print $body->as_XML_indented; $xhtml->delete; return (1); } __DATA__ <body> <blockquote id="one"> aaa bbb ccc </blockquote> <blockquote id="two"> <p> ddd eee fff </p> </blockquote> <blockquote id="three"> <p> ggg </p> <p> hhh </p> <p> iii </p> </blockquote> <blockquote id="four"> <p> jjj </p> </blockquote> </body>

The expected output would be for BLOCKQUOTE number two to contain three separate paragraphs instead of one (or four). The other P in the other BLOCKQUOTE elements should continue to be left alone, as the script currently does.

  • Comment on Replacing an HTML element with multiple elements using HTML::TreeBuilder
  • Download Code

Replies are listed 'Best First'.
Re: Replacing an HTML element with multiple elements using HTML::TreeBuilder
by choroba (Cardinal) on Jul 03, 2019 at 14:20 UTC
    Here's how to do your task in XML::XSH2, a wrapper around XML::LibXML I happen to maintain:

    open file.xml ; for my $b in //blockquote[count(p)=1] { my $texts = xsh:split("\n\n", $b/p) ; if (count($texts) > 1) { for my $text in $texts { my $p := insert element p append $b ; insert text normalize-space($text) into $p ; } rm $b/p[1] ; } } save :b ;

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Replacing an HTML element with multiple elements using HTML::TreeBuilder
by tangent (Parson) on Jul 03, 2019 at 16:08 UTC
    Not sure if I understand your requirements fully, but this works for me:
    for my $p ($xhtml->findnodes('//blockquote/p')) { my $text = $p->as_text(); $text =~ s/^\s+//; $text =~ s/\s+$//; if ( $text =~/\n\s*\n\s*/ ) { my @paragraphs = split(/\s*\n\s*/, $text); my @new_elems; for my $para (@paragraphs) { my $new = HTML::Element->new('p'); $new->push_content($para); push(@new_elems, $new); } $p->replace_with(@new_elems); } }

      Thanks. I had tried several approaches using replace_with, but each met with various kinds of failure. Your example works and gives me a bit more of an idea how to use the methods.

Re: Replacing an HTML element with multiple elements using HTML::TreeBuilder
by marto (Cardinal) on Jul 03, 2019 at 09:23 UTC

    Thanks for the example, it'd be easier to understand what you want if you showed us the HTML output you require, based on the input in your example.

      I'm aiming for some output like the following, not counting whitespace:

      <blockquote id="two"> <p>ddd</p> <p>eee</p> <p>fff</p> </blockquote>

      The one P would be replaced with three new P elements.

Re: Replacing an HTML element with multiple elements using HTML::TreeBuilder
by daxim (Curate) on Jul 03, 2019 at 12:49 UTC
    Treebuilder API is cancer. Use something better.
    use Web::Query::LibXML qw(); my $w = Web::Query->new_from_html(<<'HTML'); <body> … HTML $w->find('blockquote p')->each(sub { my @parts = split ' ', $_->text; if (@parts > 1) { for my $p (reverse @parts) { $_->after("<p>$p</p>"); } $_->remove; } }); print $w->as_html; __END__ … <blockquote id="two"><p>ddd</p><p>eee</p><p>fff</p></blockquote> …

      daxim: Treebuilder API is cancer. Use something better.

      No it isn't. It is neither damaging nor does it spread.

      Sugar is nice, unless it rots your attitude :P

Re: Replacing an HTML element with multiple elements using HTML::TreeBuilder
by skleblan (Sexton) on Jul 03, 2019 at 14:19 UTC
    I’m still working on understanding your code, but doesn’t \s include newline and carriage return characters? Whitespace

      Yes, \s covers newlines as well as other whitespace. That part of the code works for my needs. The part that I do not know how to do is getting the whole original P element replaced by multiple new P elements. Perhaps I should write a more focused example.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11102331]
Approved by hippo
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (1)
As of 2024-04-25 07:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found