Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

BBS HTML fitler

by tkroll (Initiate)
on Aug 06, 2000 at 05:30 UTC ( [id://26381]=perlquestion: print w/replies, xml ) Need Help??

tkroll has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a BBS, and I need to filter out evil HTML from messages. I want to allow only, maybe, <h>, <br>, <font...>, stuff like that. Anyone ever have to deal with this. I was just going to regex it, but I have an itching feeling that this problem has more to it. -Ty

Replies are listed 'Best First'.
(Ovid) Maybe you don't need to parse the HTML
by Ovid (Cardinal) on Aug 07, 2000 at 01:53 UTC
    I was playing around with this and I see that you may not need to parse the HTML. Basically, we're not stripping or evaluating HTML, we're trying to shut it down. jeffa had the right idea by substituting < with &lt;. I wrote a regex that might be a start for you:
    $data =~ s/ < # First '<' (?! # Not followed by (Everything in this list + is allowed) (?: # (with non-grouping parens) \/?br> # A break tag | # or \/?p> # A paragraph tag | # or \/?font[^>]*> # A font tag | # or \/?h[1-6]> # A headline ) # Close non-grouping parens ) # End of negative lookahead ( # Capture to $1 [^>]* # Everything until the final '>' ) # End capture > # Final '>' /&lt;$1&gt;/gsix;
    This regex handles the closing and ending tags. It substitutes out matched pairs of angle brackets and will ignore individual ones. I haven't tested it in depth, but I would probably want to play with this and see, with mismatched angle brackets and server side includes, if I could sneak something past this.

    If you want to allow more HTML, just add the allowable elements in the negative lookahead list. This only allows very simple tags and has the benefit of you stating what you will allow, as opposed to stating what you won't allow (which has the risk of you overlooking something).

    Also note that you want the entire document in the variable. If you run this line by line, someone could break the HTML up over several lines and beat the regex.

    And for those who prefer it on one line:

    $data =~ s/<(?!(?:\/?br>|\/?p>|\/?font[^>]*>|\/?h[1-6]>))([^>]*)>/&lt; +$1&gt;/gsi;
    Cheers,
    Ovid

    Ovid patiently waits to be blasted for this one.

      I really like the way Ovid broke up this regex with comments - I don't quite grok regexes yet, and things like this are very helpful to me. Maybe we'll see more of this in the future...hint hint...
Re: BBS HTML fitler
by davorg (Chancellor) on Aug 06, 2000 at 11:43 UTC
(jeffa) Re: BBS HTML fitler
by jeffa (Bishop) on Aug 06, 2000 at 19:19 UTC
    I know you said you wanted to keep certain HTML tags, but this solution will work in the meantime.

    I think the simplest solution is to 'literalize' HTML code, i.e. use substitution to turn angle brackets into their respective HTML ASCII tokens:

    $evil_html =~ s/</&lt;/g; $evil_html =~ s/>/&gt;/g;
    But this will, of course, hose all of your HTML code. One thing you could do is substitute the tags you want to keep into something that won't get hosed:
    %keepers = ( '<p>' => '#p#', '<br>' => '#br#', '<hr>' => '#hr#', );
    Substitute these values in the code globaly and case- insensitive, then perform the first substitution above, then substitute these values back to their original form.

    Works good, but, er, not so good for them font tags. You best bet is like davorg said, with HTML::Parser. The reason why I am posting this cargo-cult method is because you can quickly use the first substitition to make sure that your users do not abuse your BBS, while you are figuring out how to effectively use HTML::Parser.

    hope this helps

RE: BBS HTML fitler
by DrManhattan (Chaplain) on Aug 07, 2000 at 17:55 UTC

    Here's an example using HTML::TokeParser

    #!/usr/bin/perl -w use strict; use HTML::TokeParser; # Regex representing the list of acceptable tags my $ok_stuff = qr/^(p|br|h.|font|pre)$/; # Some test html. my $html = "<p><br><h3><a href='evil.js'>Testing</a></h3></p>\n"; # Instantiate the TokeParser my $parser = new HTML::TokeParser (\$html); # Loop until all tokens are read while (my $token = $parser->get_token()) { # Immediately print any "text" token if ($token->[0] eq "T") { print $token->[1]; } # Check all other tokens against the regex before printing elsif ($token->[1] =~ $ok_stuff) { print $token->[$#{$token}]; } }

    The above code prints out "<p><br><h3>Testing</h3></p>"

    -Matt

Re: BBS HTML fitler
by PotPieMan (Hermit) on Aug 07, 2000 at 06:47 UTC

    Take a look at the Slashcode. They've got a nice tag-specific way of stripping HTML in Slash.pm.

    It uses an array of permitted tags, and strips the rest.

    -ppm

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://26381]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (8)
As of 2024-04-23 09:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found