Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^6: Looking for a module that strips an HTML tag and its associated 'TEXT'

by nysus (Parson)
on Jul 29, 2020 at 14:10 UTC ( #11119977=note: print w/replies, xml ) Need Help??


in reply to Re^5: Looking for a module that strips an HTML tag and its associated 'TEXT'
in thread Looking for a module that strips an HTML tag and its associated 'TEXT'

I want the code to pass these simple tests:

# eliminate_tags is(eliminate_tags("url: <a href=\"http://example.com/\">http://example +.com/</a>", 'a'), "url: "); is(eliminate_tags("<div>\n <p>hoge foo.</p>\n <p>bar tarao.</p>\n</d +iv>", 'p'), "<div>\n \n \n</div>"); # eliminate_links is(eliminate_links("url: <a href=\"http://example.com/\">http://exampl +e.com/</a>"), "url: "); is(eliminate_links("<div>\n <p>hoge foo.</p>\n <p>bar tarao.</p>\n</ +div>"), "<div>\n <p>hoge foo.</p>\n <p>bar tarao.</p>\n</div>");

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks

  • Comment on Re^6: Looking for a module that strips an HTML tag and its associated 'TEXT'
  • Download Code

Replies are listed 'Best First'.
Re^7: Looking for a module that strips an HTML tag and its associated 'TEXT'
by bliako (Monsignor) on Jul 29, 2020 at 14:29 UTC

    This looks like a very simple DOM manipulation: delete nodes from the DOM. I can do that in Firefox's developer tools. I am sure any DOM manipulator can do that. Specifically the Mojo::DOM suggested by marto should also be able to do it - but I have not used it before. In short: parse your html and convert it to a DOM, which is a Tree of html-tag nodes. Locate the node by xpath or other exotic selector. Zap the node and/or its children. Work at as high level as you can with this one because the spec will continually change and change and it will come to bite you.

    Edit: I am not sure if the process of HTML -> DOM -> manipulate -> HTML will retain exactly the white spaces from the original HTML between tags as it seems you want to keep them given the test cases you provided.

    Add: a regex "is simpler" but it isn't.

    bw, bliako

Re^7: Looking for a module that strips an HTML tag and its associated 'TEXT'
by marto (Cardinal) on Jul 29, 2020 at 14:29 UTC

    And did you look at my earlier suggestion?

      I looked at Mojo::Dom briefly. It's a general purpose tool. Was hoping for a module that let me knock this out in in like two lines. Mojo::Dom is my fallback plan if I can't find what I'm looking for.

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
      $nysus = $PM . ' ' . $MCF;
      Click here if you love Perl Monks

        "Was hoping for a module that let me knock this out in in like two lines."

        Ignoring this many dependant modules and literally thousands of lines of code :P

        my $html = 'url: <a href="http://example.com">http://example.com</a>'; my $dom = Mojo::DOM->new( $html ); say $dom->at('a')->remove;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11119977]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2021-10-16 09:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (69 votes). Check out past polls.

    Notices?