Regex: Strip <script> tags?

Spidy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex: Strip <script> tags? by skx (Parson) on Mar 21, 2007 at 15:29 UTC
There are a lot more things that you'll need to worry about than just raw <script> tags. For example: <a href="http://example.com" onClick="alert(1);">test</a> To deal with this complexity properly you should be looking at using one of the filtering modules available from CPAN. I've got good experience of using HTML::Scrubber - but there are a few more including HTML::EscapeEvil and HTML::Sanitizer Steve --	[reply]
Re: Regex: Strip <script> tags? by ww (Archbishop) on Mar 21, 2007 at 16:19 UTC
If your only worry were attributes following the start of the tag ... as, for example, `<script src=....` you could simply remove the "`>`" at the end of the first "`<script>`" in the (cargo-culted) regex, thusly `<script .?<\/script>...` which will catch anything inside script tags (unless -- illogically, they're miss written by your users with nested <script ...> tags. (Update:* In fact, this is a faq.) However, as skx has already pointed out, evil is not restricted to items labeled "<script...> Bottom line: You should probably consider/study security issues (suggestion: start with some examples of why to use `-t` and move on to more generic considerations) AND should improve your regex-fu before borrowing code. You've been here long enough to have seen discussions of the un-wisdom of writing your own .html parsers, and might wish to review some of those (Cliff notes-style summary: you might screw up by rolling your own) and also read these old-but-still-good nodes: Re: How to remove HTML tags from text (by skx, with a more expansive version of his comment above); How do I test for potential security problems?; and Re: Remove HTML tags from document, including Jured's links to asking questions.	[reply] [d/l] [select]
Re: Regex: Strip <script> tags? by rodion (Chaplain) on Mar 21, 2007 at 16:06 UTC
skx has beter advice, but as for the question as you posed it: `s/<script[^>]>.?<\/script>//igs;` [download] should work. It accepts any characters that are not ">", up to the ">" that terminates the tag. It may not be the best solution to this particular problem, but it's a very handy regex idiom to have ready access to.	[reply] [d/l]
Re^2: Regex: Strip <script> tags? by ikegami (Patriarch) on Sep 02, 2007 at 20:01 UTC
It lets the following through: `<<script></script>script>...</script>` [download] It's also a poor regexp in a more general sense since it it doesn't check if the `>` actually closes the tag of it's inside the quotes of an attribute value.	[reply] [d/l] [select]
Re: Regex: Strip <script> tags? by duelafn (Parson) on Mar 21, 2007 at 17:33 UTC
Yes, do use a prepackaged filter. <scr<script>Kiddies</script>ipt> are clever buggers</script> Update: In response to anonymous monk below (in case you think you can win in the battle of workarounds). Check out the XSS Cheat Sheet. It is quite old, so don't count on it including all XSS exploits, however, look at that list and ask yourself whether your time is better spent researching and fighting these or actually working on something related to your site's business. --- My advice: Find and use a module which scrubs user-submitted html. Find one which is maintained and thorough. It isn't typically worth doing it yourself. (in general) No, your case is probably not special enough to warrant doing it yourself - you've got better things to do. Good Day, Dean	[reply]
Re^2: Regex: Strip <script> tags? by Anonymous Monk on May 25, 2012 at 22:35 UTC
This will fix what you're were talking about. You can loop through it as many times to remove unwanted script tags and everything within it `$bool = true; while ($bool) { $str = preg_replace('/<script\ .?<\/.?script>/i','', $str); if (!(preg_match('/<script\ .?<\/.?script>/i', $str))){ $bool = false; } }` [download]	[reply] [d/l]
Re: Regex: Strip <script> tags? by stonecolddevin (Parson) on Mar 22, 2007 at 00:31 UTC
I personally enjoy HTML::Scrubber. It allows you to create a pretty detailed profile of what HTML you want allowed/disallowed. From the docs: (Turns out JavaScript is turned off by default. See the script method for more info.) #!/usr/bin/perl -w use HTML::Scrubber; use strict; + # my $html = q[ <style type="text/css"> BAD { background: #666; color: #666;} </st +yle> <script language="javascript"> alert("Hello, I am EVIL!"); </sc +ript> <HR> a => <a href=1>link </a> br => <br> b => <B> bold </B> u => <U> UNDERLINE </U> ]; + # my $scrubber = HTML::Scrubber->new( allow => [ qw[ p b i u hr br ] + ] ); # + # print $scrubber->scrub($html); + # + # $scrubber->deny( qw[ p b i u hr br ] ); + # + # print $scrubber->scrub($html); + # [download] Hope this helps! meh.	[reply] [d/l]
Re: Regex: Strip <script> tags? by sanPerl (Friar) on Mar 22, 2007 at 06:26 UTC
It is better to use some ready-to-eat kind of CPAN module. However if you want a simple soltion then here it is. It will work for following kinds of tags 1) <script> ( should be deleted ) 2) <script a="aaa"> (should be deleted ) 3) <script1> (This should be retained by regex) 4) ANY OTHER TAG other than mentioned above (This should be retained by regex) `## This is required so that we can escape processing of tags like <scr +ipt1>, <scriptabc>,<scriptxyz>....etc from deletion s/<script>/<script >/igs; s/<script\ .?>.?<\/script>//igs;` [download]	[reply] [d/l]
Re: Regex: Strip <script> tags? by hacker (Priest) on Sep 02, 2007 at 14:37 UTC
I use the following in a piece of code here: `# Strip <script [..]>..</script> and <style>..</style> $content =~ s!<(s(?:cript\|tyle))[^>]>.?</\1>!!gis;` [download] backtracking++	[reply] [d/l]
Re^2: Regex: Strip <script> tags? by clinton (Priest) on Sep 02, 2007 at 14:49 UTC
There are plenty of things that will be missed with your regex. For instance, all of the `onclick/focus/load/etc` events. Have a look at HTML::StripScripts::Parser, which allows you to customise the HTML / CSS that you would like to allow, while removing XSS attacks. Clint	[reply] [d/l]
Re: Regex: Strip <script> tags? by Anonymous Monk on May 25, 2012 at 22:28 UTC
This will fix what duelafn was talking about. You can loop through it as many times to remove unwanted script tags and everything within it `$bool = true; while ($bool) { $str = preg_replace('/<script\ .?<\/.?script>/i','', $str); if (!(preg_match('/<script\ .?<\/.?script>/i', $str))){ $bool = false; } }` [download]	[reply] [d/l]


Your skill will accomplish what the force of many cannot
	PerlMonks