Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Strip HTML tags again

by dda (Friar)
on Jun 30, 2002 at 15:37 UTC ( #178374=perlquestion: print w/replies, xml ) Need Help??

dda has asked for the wisdom of the Perl Monks concerning the following question:

Hello All!

Can you give me a simple example of how to filter out all HTML tags in a single line of text, but 'non-HTML' tags should not be filtered? For example:

Should be filtered, the result will be "text1":

<a href="mylink>text1</a>
Should not be filtered, the result will be "<this is a normal text>":
<this is a normal text>

Your help is greatly appreciated.
--dda

Replies are listed 'Best First'.
Re: Strip HTML tags again
by Ovid (Cardinal) on Jun 30, 2002 at 20:20 UTC

    This problem looks tailor-made for my HTML::TokeParser::Simple module, when combined with HTML::Tagset. The following test will demonstrate:

    #!/usr/bin/perl -w use strict; use HTML::TokeParser::Simple; use HTML::Tagset; my $html = <<'END_HTML'; <a href="mylink">text1</a> <this is normal text> END_HTML my $p = HTML::TokeParser::Simple->new( \$html ); while ( my $token = $p->get_token ) { next if ! $token->is_text and exists $HTML::Tagset::isKnown{ $token->return_tag }; print $token->return_text; }

    Result:

    text1
    <this is normal text>

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

      ++ for the original. I'm posting an updated example because some changes to the module seem to have borked your example. This is an in place stripper--based on the one you posted--with the newer/working syntax.

      sub strip_html { my $renew = ""; my $p = HTML::TokeParser::Simple->new(\$_[0]); no warnings "uninitialized"; while ( my $token = $p->get_token ) { next if ! $token->is_text and exists $HTML::Tagset::isKnown{ $token->get_tag }; $renew .= $token->as_is; } $_[0] = $renew; }
•Re: Strip HTML tags again
by merlyn (Sage) on Jun 30, 2002 at 15:54 UTC
    Here's an example from the eg directory in the HTML::Parser distribution:
    #!/usr/bin/perl -w # Extract all plain text from an HTML file use strict; use HTML::Parser 3.00 (); my %inside; sub tag { my($tag, $num) = @_; $inside{$tag} += $num; print " "; # not for all tags } sub text { return if $inside{script} || $inside{style}; print $_[0]; } HTML::Parser->new(api_version => 3, handlers => [start => [\&tag, "tagname, '+1'"], end => [\&tag, "tagname, '-1'"], text => [\&text, "dtext"], ], marked_sections => 1, )->parse_file(shift) || die "Can't open file: $!\n";;

    -- Randal L. Schwartz, Perl hacker

Re: Stripping HTML tags from a document
by cjf (Parson) on Jun 30, 2002 at 15:55 UTC

    Have a look at HTML::Tagset it contains various lists of valid HTML tags for different sections of a document.

    Update: ++ to Ovid for providing the working example below.

      Thanks!!! It is the stuff I was looking for. Now I'd like to know how to use it in a 'perl' manner. Currently I have the following code (right from perlfaq):
      sub strip_html { my $t = shift; $t =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gs; return $t; }
      Seems like I have to use %HTML::Tagset::isKnown hash, but how to apply it to my sub? I can't find any quick way...

      --dda

Re: Strip HTML tags again
by tachyon (Chancellor) on Jun 30, 2002 at 20:35 UTC

      heh. i ended up writing HTML::TagFilter because tachyon shouted at me so loud. Which only does part of what you want, sadly, so I wouldn't recommend it. But i'm usefully reminded to finish the next version, which does the rest. And lots of other exciting things, i feel sure.

Re: Strip HTML tags again
by hacker (Priest) on Jul 01, 2002 at 10:33 UTC
    Have you looked at HTML::LinkExtor? It sounds to be exactly what you want:
    HTML::LinkExtor is an HTML parser that extracts links from an HTML document. The HTML::LinkExtor is a subclass of HTML::Parser. This means that the document should be given to the parser by calling the $p->parse() or $p->parse_file() methods.

    I've used it successfully in the past with a lot of my parsing code.

Re: Strip HTML tags again
by ides (Deacon) on Jun 30, 2002 at 15:47 UTC
    This will probably do the trick, however this does not handle HTML tags that span multiple lines. To do that you'll most likely have to join all the lines together into one scalar. This will also not catch multiple HTML tags on the same line, you'll need to modify it to suit your needs.

    What this is doing is finding text contained in <>'s that has a corresponding ending tag.

    Here is the code ($l is the scalar holding the line of text):

    if( $l =~ /<.*?>(.*?)<\/.*?>/ ) { $l = $1; }

    -----------------------------------
    Frank Wiles <frank@wiles.org>
    http://frank.wiles.org

      Thanks, but I need a solution which 'knows' about possible HTML tags. What I need is to filter HTML from a chat message, and if someone type '<Hehe>' - it will be wiped off.

      --dda

Re: Strip HTML tags again
by Mask (Pilgrim) on Jul 01, 2002 at 11:38 UTC
    Hi monks, i am little bit disappointed in all this discussion.
    If the input from chat is displayed in HTML page, then any "<" or ">" in the displayed text will be transformed to the &lt; and &gt; . So if you can see <this is a normal text> in your web browser, than in the sources of a HTML page it will be &lt;this is a normal text&gt; in this case you should not be bothered about knowing all tags, and if you want to see the text as it is in browser you need just to replace "&lt;" by "<" and "&gt;" by ">" in your perl code.
      Ok, I'll try to explain. If someone types '<b>some text</b>', it should not be be displayed as a bold text in chat window, and displaying the HTML source is not a good idea too. I just have to strip all tags from the line.

      And '<some text>' should not be stripped. All regexp-based solutions will strip this too. I'm still looking for regexp-based stuff that will use %HTML::Tagset::isKnown hash to filter out only correct HTML tags.

      --dda

        look up the POD (or your preferred docs) for HTML::Tagset
        cite: "hashset %HTML::Tagset::isKnown
        This hashset lists all known HTML elements."
        So you've got to compare your match with that list ...

        Have a nice day
        All decision is left to your taste

        Addendum

        Look through the previous suggestions as well. Try it at least and ask again if you get an error or get otherwise stuck. :-)

Re: Strip HTML tags again
by mousey (Scribe) on Jul 01, 2002 at 06:20 UTC
    $foo =~ s/<(.|\n)+?>//g;


    This is great! from up here we can throw lots and lots of stuf! but uh...how do we get down? --Goblin Balloon Brigade

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://178374]
Approved by cjf
Front-paged by cjf
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2023-06-10 18:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you go to conferences?






    Results (39 votes). Check out past polls.

    Notices?