I was playing around with this and I see that you may not need to parse the HTML. Basically, we're not stripping or evaluating HTML, we're trying to shut it down. jeffa had the right idea by substituting < with <. I wrote a regex that might be a start for you:
$data =~ s/
< # First '<'
(?! # Not followed by (Everything in this list
+ is allowed)
(?: # (with non-grouping parens)
\/?br> # A break tag
| # or
\/?p> # A paragraph tag
| # or
\/?font[^>]*> # A font tag
| # or
\/?h[1-6]> # A headline
) # Close non-grouping parens
) # End of negative lookahead
( # Capture to $1
[^>]* # Everything until the final '>'
) # End capture
> # Final '>'
/<$1>/gsix;
This regex handles the closing and ending tags. It substitutes out matched pairs of angle brackets and will ignore individual ones. I haven't tested it in depth, but I would probably want to play with this and see, with mismatched angle brackets and server side includes, if I could sneak something past this.
If you want to allow more HTML, just add the allowable elements in the negative lookahead list. This only allows very simple tags and has the benefit of you stating what you will allow, as opposed to stating what you won't allow (which has the risk of you overlooking something).
Also note that you want the entire document in the variable. If you run this line by line, someone could break the HTML up over several lines and beat the regex.
And for those who prefer it on one line:
$data =~ s/<(?!(?:\/?br>|\/?p>|\/?font[^>]*>|\/?h[1-6]>))([^>]*)>/<
+$1>/gsi;
Cheers,
Ovid
Ovid patiently waits to be blasted for this one. |