note
haukex
<p>Obligatory Link to [id://11116478]...</p>
<p>Based on your function I'm presuming you want to preserve tags - if you didn't, then the task would be easily accomplished with something like [mod://HTML::Strip].</p>
<p>You haven't provided any sample input, so I had to make some up, I hope it's representative - but note that it already demonstrates some flaws if I run it through your function: <c>/(.*)<(.*)/</c> needs an <c>/s</c> flag, and the <c><p></c> and <c><i></c> tags are not closed properly. I could also easily break it completely with some of the tricks in the above link.</p>
<p>Doing the task "right" is unfortunately not exactly trivial even with some of the nice HTML parsers. Here's my attempt, which I haven't fully put through its paces in terms of testing. It was a nice exercise because I actually haven't really used [mod://Mojo::DOM] for DOM creation yet. Note how it counts characters of text only, not including the HTML tags.</p>
<c>
use warnings;
use strict;
print html_abstract(<<'END_HTML', 200), "\n";
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed
tristique purus urna, a lacinia nulla euismod et. Pellentesque tempus
et justo faucibus. <i>Fusce scelerisque, <b>magna</b> <a
href="http://www.example.com">efficitur congue, leo nibh</a>
volutpat nibh, ac mattis dolor ipsum sit amet quam.</i> Suspendisse
eleifend id ligula quis placerat. Pellentesque fermentum eu magna sed
mollis. Quisque placerat efficitur blandit. Vestibulum non.</p>
END_HTML
use Mojo::DOM;
sub html_abstract {
my ($html, $remain) = @_;
my $walk; $walk = sub {
my ($in, $out) = @_;
for my $n ( @{ $in->child_nodes } ) {
last unless $remain;
if ( $n->type eq 'cdata' || $n->type eq 'text' ) {
my $txt = $n->content;
if ( length $txt < $remain ) {
$out->append_content($txt);
$remain -= length $txt;
}
else {
$txt =~ /^(.{0,$remain}\b)/s;
$out->append_content("$1...");
$remain = 0;
}
}
elsif ( $n->type eq 'tag' ) {
my $t = $out->new_tag( $n->tag, %{ $n->attr } )
# new_tag gives us a "root", but we want the tag
->child_nodes->first;
$walk->($n, $t);
$out->append_content($t);
} # ignore other node types for now
}
return $out;
};
return $walk->(Mojo::DOM->new($html), Mojo::DOM->new)->to_string;
}
__END__
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed
tristique purus urna, a lacinia nulla euismod et. Pellentesque tempus
et justo faucibus. <i>Fusce scelerisque, <b>magna</b> <a href="http://www.example.com">efficitur congue, leo ...</a></i></p>
</c>
<p><b>Update:</b> The above can also be extended to filter certain tags by adding this before the <c>elsif ( $n->type eq 'tag' )</c>, where <c>%filter</c> is a hash with the keys being names of tags to remove (or the condition can be reversed to keep only those tags):</p>
<c>
elsif ( $n->type eq 'tag' && $filter{$n->tag} )
{ $walk->($n, $out) }
</c>
11135746
11135746