Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Re: Strip HTML tags again

by dda (Friar)
on Jul 01, 2002 at 12:49 UTC ( [id://178531]=note: print w/replies, xml ) Need Help??


in reply to Re: Strip HTML tags again
in thread Strip HTML tags again

Ok, I'll try to explain. If someone types '<b>some text</b>', it should not be be displayed as a bold text in chat window, and displaying the HTML source is not a good idea too. I just have to strip all tags from the line.

And '<some text>' should not be stripped. All regexp-based solutions will strip this too. I'm still looking for regexp-based stuff that will use %HTML::Tagset::isKnown hash to filter out only correct HTML tags.

--dda

Replies are listed 'Best First'.
Re: Re: Re: Strip HTML tags again
by little (Curate) on Jul 01, 2002 at 12:55 UTC
    look up the POD (or your preferred docs) for HTML::Tagset
    cite: "hashset %HTML::Tagset::isKnown
    This hashset lists all known HTML elements."
    So you've got to compare your match with that list ...

    Have a nice day
    All decision is left to your taste

    Addendum

    Look through the previous suggestions as well. Try it at least and ask again if you get an error or get otherwise stuck. :-)

      The problem is how to extract 'my match' from the regexp shown earlier (or other - please suggest one).. I know about that hashset, and what I need is to apply it to my sub.

      --dda

        Hi ! I think this does what you want:
        use HTML::Tagset; my %tags = %HTML::Tagset::isKnown; my $tagpattern = "(".join('|',keys %tags).")"; print STDERR "$tagpattern\n"; while (<>) { print strip_html_tags($_); } sub strip_html_tags { my $line = shift; $line =~ s/<\s*$tagpattern(?:\s*>|\s+[^>]*>)([^<]*)<\s*\/\1[^>]*>/$2 +/ig; return $line; }
        I first create the string $tagpattern by putting a "|" between all known HTML tags and surrounding the whole thing with parantheses. This will give something like "(a|p|code.....)" and is used later in the subroutine to check for valid HTML tags.

        The regex looks a bit complicated and I am sure that it can be written much better, but I believe it is sufficient for your cause.

        Note that this will only work for tags that are on one line and could get you into trouble if there are < or > signs inside a tag (Don't know if this is possible in HTML).

        update:

        It would propably be a lot wiser to use Ovid's code then my homegrown regex.

        ---- kurt
        Did you look further than ides' suggestion? Did you try Ovid's suggestion?
        </code>
        Have a nice day
        All decision is left to your taste

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://178531]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (8)
As of 2024-04-23 11:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found