BBS HTML fitler

tkroll has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(Ovid) Maybe you don't need to parse the HTML by Ovid (Cardinal) on Aug 07, 2000 at 01:53 UTC
I was playing around with this and I see that you may not need to parse the HTML. Basically, we're not stripping or evaluating HTML, we're trying to shut it down. jeffa had the right idea by substituting < with <. I wrote a regex that might be a start for you: `$data =~ s/ < # First '<' (?! # Not followed by (Everything in this list + is allowed) (?: # (with non-grouping parens) \/?br> # A break tag \| # or \/?p> # A paragraph tag \| # or \/?font[^>]> # A font tag \| # or \/?h[1-6]> # A headline ) # Close non-grouping parens ) # End of negative lookahead ( # Capture to $1 [^>] # Everything until the final '>' ) # End capture > # Final '>' /<$1>/gsix;` [download] This regex handles the closing and ending tags. It substitutes out matched pairs of angle brackets and will ignore individual ones. I haven't tested it in depth, but I would probably want to play with this and see, with mismatched angle brackets and server side includes, if I could sneak something past this. If you want to allow more HTML, just add the allowable elements in the negative lookahead list. This only allows very simple tags and has the benefit of you stating what you will allow, as opposed to stating what you won't allow (which has the risk of you overlooking something). Also note that you want the entire document in the variable. If you run this line by line, someone could break the HTML up over several lines and beat the regex. And for those who prefer it on one line: `$data =~ s/<(?!(?:\/?br>\|\/?p>\|\/?font[^>]>\|\/?h[1-6]>))([^>])>/< +$1>/gsi;` [download] Cheers, Ovid Ovid patiently waits to be blasted for this one.	[reply] [d/l] [select]
RE: (Ovid) Maybe you don't need to parse the HTML by Cirollo (Friar) on Aug 07, 2000 at 18:00 UTC
I really like the way Ovid broke up this regex with comments - I don't quite grok regexes yet, and things like this are very helpful to me. Maybe we'll see more of this in the future...hint hint...	[reply]
Re: BBS HTML fitler by davorg (Chancellor) on Aug 06, 2000 at 11:43 UTC
Regexen can only handle subsets of HTML. To do the job properly you'll need to use HTML::Parser or one of its subsclasses. -- <http://www.dave.org.uk> European Perl Conference - Sept 22/24 2000, ICA, London <http://www.yapc.org/Europe/>	[reply]
(jeffa) Re: BBS HTML fitler by jeffa (Bishop) on Aug 06, 2000 at 19:19 UTC
I know you said you wanted to keep certain HTML tags, but this solution will work in the meantime. I think the simplest solution is to 'literalize' HTML code, i.e. use substitution to turn angle brackets into their respective HTML ASCII tokens: `$evil_html =~ s/</</g; $evil_html =~ s/>/>/g;` [download] But this will, of course, hose all of your HTML code. One thing you could do is substitute the tags you want to keep into something that won't get hosed: `%keepers = ( '<p>' => '#p#', '<br>' => '#br#', '<hr>' => '#hr#', );` [download] Substitute these values in the code globaly and case- insensitive, then perform the first substitution above, then substitute these values back to their original form. Works good, but, er, not so good for them font tags. You best bet is like davorg said, with HTML::Parser. The reason why I am posting this cargo-cult method is because you can quickly use the first substitition to make sure that your users do not abuse your BBS, while you are figuring out how to effectively use HTML::Parser. hope this helps	[reply] [d/l] [select]
RE: BBS HTML fitler by DrManhattan (Chaplain) on Aug 07, 2000 at 17:55 UTC
Here's an example using HTML::TokeParser #!/usr/bin/perl -w use strict; use HTML::TokeParser; # Regex representing the list of acceptable tags my $ok_stuff = qr/^(p\|br\|h.\|font\|pre)$/; # Some test html. my $html = "<p><br><h3><a href='evil.js'>Testing</a></h3></p>\n"; # Instantiate the TokeParser my $parser = new HTML::TokeParser (\$html); # Loop until all tokens are read while (my $token = $parser->get_token()) { # Immediately print any "text" token if ($token->[0] eq "T") { print $token->[1]; } # Check all other tokens against the regex before printing elsif ($token->[1] =~ $ok_stuff) { print $token->[$#{$token}]; } } [download] The above code prints out `"<p><br><h3>Testing</h3></p>"` -Matt	[reply] [d/l] [select]
Re: BBS HTML fitler by PotPieMan (Hermit) on Aug 07, 2000 at 06:47 UTC
Take a look at the Slashcode. They've got a nice tag-specific way of stripping HTML in Slash.pm. It uses an array of permitted tags, and strips the rest. -ppm	[reply]


We don't bite newbies here... much
	PerlMonks