Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Obligatory Link to Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks...

Based on your function I'm presuming you want to preserve tags - if you didn't, then the task would be easily accomplished with something like HTML::Strip.

You haven't provided any sample input, so I had to make some up, I hope it's representative - but note that it already demonstrates some flaws if I run it through your function: /(.*)<(.*)/ needs an /s flag, and the <p> and <i> tags are not closed properly. I could also easily break it completely with some of the tricks in the above link.

Doing the task "right" is unfortunately not exactly trivial even with some of the nice HTML parsers. Here's my attempt, which I haven't fully put through its paces in terms of testing. It was a nice exercise because I actually haven't really used Mojo::DOM for DOM creation yet. Note how it counts characters of text only, not including the HTML tags.

use warnings; use strict; print html_abstract(<<'END_HTML', 200), "\n"; <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed tristique purus urna, a lacinia nulla euismod et. Pellentesque tempus et justo faucibus. <i>Fusce scelerisque, <b>magna</b> <a href="http://www.example.com">efficitur congue, leo nibh</a> volutpat nibh, ac mattis dolor ipsum sit amet quam.</i> Suspendisse eleifend id ligula quis placerat. Pellentesque fermentum eu magna sed mollis. Quisque placerat efficitur blandit. Vestibulum non.</p> END_HTML use Mojo::DOM; sub html_abstract { my ($html, $remain) = @_; my $walk; $walk = sub { my ($in, $out) = @_; for my $n ( @{ $in->child_nodes } ) { last unless $remain; if ( $n->type eq 'cdata' || $n->type eq 'text' ) { my $txt = $n->content; if ( length $txt < $remain ) { $out->append_content($txt); $remain -= length $txt; } else { $txt =~ /^(.{0,$remain}\b)/s; $out->append_content("$1..."); $remain = 0; } } elsif ( $n->type eq 'tag' ) { my $t = $out->new_tag( $n->tag, %{ $n->attr } ) # new_tag gives us a "root", but we want the tag ->child_nodes->first; $walk->($n, $t); $out->append_content($t); } # ignore other node types for now } return $out; }; return $walk->(Mojo::DOM->new($html), Mojo::DOM->new)->to_string; } __END__ <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed tristique purus urna, a lacinia nulla euismod et. Pellentesque tempus et justo faucibus. <i>Fusce scelerisque, <b>magna</b> <a href="http:// +www.example.com">efficitur congue, leo ...</a></i></p>

Update: The above can also be extended to filter certain tags by adding this before the elsif ( $n->type eq 'tag' ), where %filter is a hash with the keys being names of tags to remove (or the condition can be reversed to keep only those tags):

elsif ( $n->type eq 'tag' && $filter{$n->tag} ) { $walk->($n, $out) }

In reply to Re: Creating an abstract (updated) by haukex
in thread Creating an abstract by Bod

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2022-05-29 03:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (101 votes). Check out past polls.

    Notices?