Re: Regex: Strip <script> tags?
by skx (Parson) on Mar 21, 2007 at 15:29 UTC
|
<a href="http://example.com" onClick="alert(1);">test</a>
To deal with this complexity properly you should be looking at using one of the filtering modules available from CPAN.
I've got good experience of using HTML::Scrubber - but there are a few more including HTML::EscapeEvil and HTML::Sanitizer
| [reply] |
Re: Regex: Strip <script> tags?
by ww (Archbishop) on Mar 21, 2007 at 16:19 UTC
|
If your only worry were attributes following the start of the tag ... as, for example,
<script src=....
you could simply remove the ">" at the end of the first "<script>" in the (cargo-culted) regex, thusly
<script .*?<\/script>...
which will catch anything inside script tags (unless -- illogically, they're miss written by your users with nested <script ...> tags.
(Update: In fact, this is a faq.)
However, as skx has already pointed out, evil is not restricted to items labeled "<script...>
Bottom line: You should probably consider/study security issues (suggestion: start with some examples of why to use -t and move on to more generic considerations) AND should improve your regex-fu before borrowing code.
You've been here long enough to have seen discussions of the un-wisdom of writing your own .html parsers, and might wish to review some of those (Cliff notes-style summary: you might screw up by rolling your own) and also read these old-but-still-good nodes: Re: How to remove HTML tags from text (by skx, with a more expansive version of his comment above); How do I test for potential security problems?; and Re: Remove HTML tags from document, including Jured's links to asking questions. | [reply] [d/l] [select] |
Re: Regex: Strip <script> tags?
by rodion (Chaplain) on Mar 21, 2007 at 16:06 UTC
|
skx has beter advice, but as for the question as you posed it:
s/<script[^>]*>.*?<\/script>//igs;
should work. It accepts any characters that are not ">", up to the ">" that terminates the tag. It may not be the best solution to this particular problem, but it's a very handy regex idiom to have ready access to. | [reply] [d/l] |
|
<<script></script>script>...</script>
It's also a poor regexp in a more general sense since it it doesn't check if the > actually closes the tag of it's inside the quotes of an attribute value.
| [reply] [d/l] [select] |
Re: Regex: Strip <script> tags?
by duelafn (Parson) on Mar 21, 2007 at 17:33 UTC
|
Yes, do use a prepackaged filter. <scr<script>Kiddies</script>ipt> are clever buggers</script>
Update: In response to anonymous monk below (in case you think you can win in the battle of workarounds). Check out the XSS Cheat Sheet. It is quite old, so don't count on it including all XSS exploits, however, look at that list and ask yourself whether your time is better spent researching and fighting these or actually working on something related to your site's business. --- My advice: Find and use a module which scrubs user-submitted html. Find one which is maintained and thorough. It isn't typically worth doing it yourself. (in general) No, your case is probably not special enough to warrant doing it yourself - you've got better things to do.
| [reply] |
|
$bool = true;
while ($bool) {
$str = preg_replace('/<script\ .*?<\/.*?script>/i','', $str);
if (!(preg_match('/<script\ .*?<\/.*?script>/i', $str))){
$bool = false;
}
}
| [reply] [d/l] |
Re: Regex: Strip <script> tags?
by stonecolddevin (Parson) on Mar 22, 2007 at 00:31 UTC
|
I personally enjoy HTML::Scrubber.
It allows you to create a pretty detailed profile of what HTML you want allowed/disallowed.
From the docs:
(Turns out JavaScript is turned off by default. See the script method for more info.)
#!/usr/bin/perl -w
use HTML::Scrubber;
use strict;
+ #
my $html = q[
<style type="text/css"> BAD { background: #666; color: #666;} </st
+yle>
<script language="javascript"> alert("Hello, I am EVIL!"); </sc
+ript>
<HR>
a => <a href=1>link </a>
br => <br>
b => <B> bold </B>
u => <U> UNDERLINE </U>
];
+ #
my $scrubber = HTML::Scrubber->new( allow => [ qw[ p b i u hr br ]
+ ] ); #
+ #
print $scrubber->scrub($html);
+ #
+ #
$scrubber->deny( qw[ p b i u hr br ] );
+ #
+ #
print $scrubber->scrub($html);
+ #
Hope this helps!
| [reply] [d/l] |
Re: Regex: Strip <script> tags?
by sanPerl (Friar) on Mar 22, 2007 at 06:26 UTC
|
It is better to use some ready-to-eat kind of CPAN module. However if you want a simple soltion then here it is.
It will work for following kinds of tags
1) <script> ( should be deleted )
2) <script a="aaa"> (should be deleted )
3) <script1> (This should be retained by regex)
4) ANY OTHER TAG other than mentioned above (This should be retained by regex)
## This is required so that we can escape processing of tags like <scr
+ipt1>, <scriptabc>,<scriptxyz>....etc from deletion
s/<script>/<script >/igs;
s/<script\ .*?>.*?<\/script>//igs;
| [reply] [d/l] |
Re: Regex: Strip <script> tags?
by hacker (Priest) on Sep 02, 2007 at 14:37 UTC
|
# Strip <script [..]>..</script> and <style>..</style>
$content =~ s!<(s(?:cript|tyle))[^>]*>.*?</\1>!!gis;
backtracking++ | [reply] [d/l] |
|
| [reply] [d/l] |
Re: Regex: Strip <script> tags?
by Anonymous Monk on May 25, 2012 at 22:28 UTC
|
This will fix what duelafn was talking about. You can loop through it as many times to remove unwanted script tags and everything within it
$bool = true;
while ($bool) {
$str = preg_replace('/<script\ .*?<\/.*?script>/i','', $str);
if (!(preg_match('/<script\ .*?<\/.*?script>/i', $str))){
$bool = false;
}
}
| [reply] [d/l] |