HTML::Scrubber - Perl extension for scrubbing/sanitizing html

Category:	HTML Utility
Author/Contact Info	/msg podmaster
Description:	`use HTML::Scrubber; my $html = q[ <B> bold </B> <i> italic </i> <u> underlined </u> ]; my $scrubber = HTML::Scrubber->new( allow => [ qw[ p b i u hr br ] ] ); print $scrubber->scrub($html); $scrubber->deny( qw[ p b i u hr br ] ); print $scrubber->scrub($html); __END__` [download] bold italic underlined bold italic underlined Grab a vanilla tarball here. update: I've done the TODO, and uploaded HTML-Scrubber-0.02.tar.gz to CPAN. Version 0.01 kept here for historical purposes.
=head1 NAME HTML::Scrubber - Perl extension for scrubbing/sanitizing html =head1 SYNOPSIS =for example begin #!/usr/bin/perl -w use HTML::Scrubber; use strict; + # my $html = q[ <HR> + # <B> bold + # <U> underlined + # <I> + # <A href=#> LINK </A> + # </I> + # </U> + # </B> + # </HR> + # ]; + # my $scrubber = HTML::Scrubber->new( allow => [ qw[ p b i u hr br ] + ] ); + # print $scrubber->scrub($html); + # $scrubber->deny( qw[ p b i u hr br ] ); + # print $scrubber->scrub($html); + # =for example end =head1 DESCRIPTION If you wanna "scrubbing"/"sanitize" html input, this be the modulehtml I wasn't satisfied with HTML::Sanitizer because it is based on HTML::TreeBuilder, so I thought I'd write something similar that works directly with HTML::Parser. =head1 How does it work? First a note on documentation: just study the example below. It's all the documentation you could need. If you're new to perl, good luck to you. When a tag is encountered, HTML::Scrubber allows/denies the tag using the explicit rule if one exists. If no explicit rule exists, Scrubber applies the default rule. If an explicit rule exists, but it's a simple rule(0 or 1), the default attribute rule is applied. =head2 EXAMPLE =for example begin #!/usr/bin/perl -w use HTML::Scrubber; use strict; + # my @allow = qw[ br hr b a ]; + # my @rules = ( script => 0, img => { src => qr{^(?!http://)}i, # only relative image links allo +wed alt => 1, '' => 0, }, ); + # my @default = ( 0 => # default rule, deny all tags { '' => 1, # default rule, allow all attributes 'href' => qr{^(?!(?:java)?script)}i, 'src' => qr{^(?!(?:java)?script)}i, # If your perl doesn't have qr # just use a string with length greater than 1 'cite' => '(?i-xsm:^(?!(?:java)?script))', 'language' => 0, 'name' => 1, # could be sneaky, but hey ;) 'onblur' => 0, 'onchange' => 0, 'onclick' => 0, 'ondblclick' => 0, 'onerror' => 0, 'onfocus' => 0, 'onkeydown' => 0, 'onkeypress' => 0, 'onkeyup' => 0, 'onload' => 0, 'onmousedown' => 0, 'onmousemove' => 0, 'onmouseout' => 0, 'onmouseover' => 0, 'onmouseup' => 0, 'onreset' => 0, 'onselect' => 0, 'onsubmit' => 0, 'onunload' => 0, 'src' => 0, 'type' => 0, } ); + # my $scrubber = HTML::Scrubber->new(); $scrubber->allow( @allow ); $scrubber->rules( @rules ); # key/value pairs $scrubber->default( @default ); $scrubber->comment(1); # 1 allow, 0 deny + # ## preferred way to create the same object $scrubber = HTML::Scrubber->new( allow => \@allow, rules => \@rules, default => \@default, comment => 1, process => 0, ); + # require Data::Dumper,die Data::Dumper::Dumper($scrubber) if @ARGV; + # my $it = q[ <?php echo(" EVIL EVIL EVIL "); ?> <!-- asdf --> <hr> <I FAKE="attribute" > IN ITALICS WITH FAKE="attribute" </I><br +> <B> IN BOLD </B><br> <A NAME="evil"> <A HREF="javascript:alert('die die die');"> HREF=JAVASCRIP +T <!> </A> <br> <A HREF="image/bigone.jpg" ONMOUSEOVER="alert('die die die +');"> <IMG SRC="image/smallone.jpg" ALT="ONMOUSEOVER JAVASCR +IPT"> </A> </A> <br> ]; + # print "#original text",$/, $it, $/; print "#scrubbed text (default ", $scrubber->default(), # no arguments returns the current value " comment ", $scrubber->comment(), " process ", $scrubber->process(), " )", $/, $scrubber->scrub($it), $/; + # $scrubber->default(1); # allow all tags by default $scrubber->comment(0); # deny comments + # print "#scrubbed text (default ", $scrubber->default(), " comment ", $scrubber->comment(), " process ", $scrubber->process(), " )", $/, $scrubber->scrub($it), $/; + # $scrubber->process(1); # allow process instructions (danger +ous) $default[0] = 1; # allow all tags by default $default[1]->{''} = 0; # deny all attributes by default $scrubber->default(@default); # set the default again + # print "#scrubbed text (default ", $scrubber->default(), " comment ", $scrubber->comment(), " process ", $scrubber->process(), " )", $/, $scrubber->scrub($it), $/; =for example end =cut package HTML::Scrubber; use HTML::Parser(); use HTML::Entities; use vars qw[ $VERSION ]; use strict; $VERSION = '0.01'; sub new { my $package = shift; my $p = HTML::Parser->new( api_version => 3, default_h => [\&_scrub, "self, event, tagname, attr, text"], ); my $self = { _p => $p, _rules => { '' => 0, }, _comment => 0, _process => 0, _r => "", }; $p->{"\0_s"} = bless $self, $package; return $self unless @_; my %args = @_; for my $f( qw[ default allow deny rules process comment ] ) { next unless exists $args{$f}; if( ref $args{$f} ) { $self->$f( @{ $args{$f} } ) ; } else { $self->$f( $args{$f} ) ; } } return $self; } sub _validate { my($t, $r, $a) = @_; return "<$t>" unless %$a; my %f; for my $k( keys %$a ) { if( exists $r->{$k} ) { if( ref $r->{$k} \|\| length($r->{$k}) > 1 ) { $f{$k} = $a->{$k} if $a->{$k} =~ m{$r->{$k}}; } elsif( $r->{$k} ) { $f{$k} = $a->{$k}; } } elsif( exists $r->{''} and $r->{''} ) { $f{$k} = $a->{$k}; } } return "<$t $r>" if $r = join ' ', map { qq[$_="] .encode_entities($f{$_}) .q["] } keys %f; return "<$t>"; } sub _scrub { my( $p, $e, $t, $a, $text ) = @_; my $s = $p->{"\0_s"} ; if ( $e eq 'start' ) { if( exists $s->{_rules}->{$t} ) { # is there a specific rule if( ref $s->{_rules}->{$t} ) { # is it complicated?(not si +mple;) $s->{_r} .= _validate($t, $s->{_rules}->{$t}, $a); } elsif( $s->{_rules}->{$t} ) { # validate using default attribute rule $s->{_r} .= _validate($t, $s->{_rules}->{'_'}, $a); } } elsif( $s->{_rules}->{''} ) { # default allow tags $s->{_r} .= _validate($t, $s->{_rules}->{'_'}, $a); } } elsif ( $e eq 'end' ) { if( exists $s->{_rules}->{$t} ) { $s->{_r} .= "</$t>" if $s->{_rules}->{$t}; } elsif( $s->{_rules}->{''} ) { $s->{_r} .= "</$t>"; } } elsif ( $e eq 'comment' ) { $s->{_r} .= $text if $s->{_comment}; } elsif ( $e eq 'process' ) { $s->{_r} .= $text if $s->{_process}; } elsif ( $e eq 'text' or $e eq 'default') { $s->{_r} .= $text; } elsif ( $e eq 'start_document' ) { $s->{_r} = ""; } } sub comment { return $_[0]->{_comment} if @_ == 1; $_[0]->{_comment} = $_[1]; return(); } sub process { return $_[0]->{_process} if @_ == 1; $_[0]->{_process} = $_[1]; return(); } sub allow { my $self = shift; $self->{_rules}{$_}=1 for @_; # $self->{_p}->report_tags(@_); # I can't do this one ;( return(); } sub deny { my $self = shift; $self->{_rules}{$_} = 0 for @_; # $self->{_p}->ignore_tags(@_); # optimization ;) return(); } sub rules{ my $self = shift; my %rules = @_; for my $k(keys %rules) { $self->{_rules}{$k} = $rules{$k}; } return(); } sub default { return $_[0]->{_rules}{''} if @_ == 1; $_[0]->{_rules}{''} = $_[1] if defined $_[1]; $_[0]->{_rules}{'_'} = $_[2] if defined $_[2]; return(); } sub scrub_file { $_[0]->{_p}->parse_file($_[1]); return delete $_[0]->{_r}; } sub scrub { $_[0]->{_p}->parse($_[1]); $_[0]->{_p}->eof(); return delete $_[0]->{_r}; } 1; #print sprintf q[ '%-12s => %s,], "$_'", $h{$_} for sort keys %h;# per +l! =head2 FUN If you don't know how to install this, try use ExtUtils::MakeMaker; WriteMakefile( 'NAME' => 'HTML::Scrubber', 'VERSION_FROM' => 'Scrubber.pm', # finds $VERSION ); If you have Test::Inline, try pod2test Scrubber.pm >scrubber.t perl scrubber.t =head1 CAVEATS The interface is subject to change until I release this on CPAN. The allow/deny/default methods are likely to go away, so I can implement some optimizations (see L<TODO\|"TODO"> for details) +. =head1 TODO Top priority, allow for redirecting output to a file \|\| filehandle instead of just returning a giant string. Currently, $p->ignore_tags( TAG, ... ) is used as an optimization. I need to figure out the semantics of using $p->report_tags( TAG, ... ) as an optimization option. See L<HTML::Parser\|HTML::Parser> if you're curious. I have no plans at all of using $p->ignore_elements( TAG, ... ) =head1 SEE ALSO L<HTML::Parser>, L<Test::Inline>, L<HTML::Sanitizer>. =head1 AUTHOR D.H aka PodMaster =for in_the_future Please use http://rt.cpan.org/ to report bugs. http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Scrubber =head1 LICENSE Copyright (c) 2003 by D.H. aka PodMaster. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut

Comment on HTML::Scrubber - Perl extension for scrubbing/sanitizing html Download Code

Replies are listed 'Best First'.

Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html
by Hero Zzyzzx (Curate) on Apr 18, 2003 at 14:17 UTC

What does this do that HTML::TagFilter doesn't? HTML::Scrubber and HTML::TagFilter appear to attack the exact same problem with nearly the same interface.

-Any sufficiently advanced technology is
indistinguishable from doubletalk.

[reply]

Re: Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html

by PodMaster (Abbot) on Apr 18, 2003 at 14:40 UTC

I wasn't aware of HTML::TagFilter. Like I say in the description, I wasn't satisfied with HTML::Sanitizer, so I wrote HTML::Scrubber.

HTML::Scrubber and HTML::TagFilter appear to attack the exact same problem with nearly the same interface.

HTML::Sanitizer

Looking at TagFilter's TO DO, I already have tests and do the HTML::Parser level optimization he talks about (he's welcomed to borrow those bits if he can).

I don't intend to expand HTML::Scrubber beyond the interface you see now (like he does). I like my simple interface.

HTML::Scrubber is never going to do any rule based translations (like changing all font tags with size=2 to span class=bar, or whatever).

++Hero Zzyzzx, it's nice to see others play merlyn ;D

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: Re: Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html

by Anonymous Monk on Apr 19, 2003 at 00:58 UTC

Looking at TagFilter's TO DO, I already have tests and do the HTML::Parser level optimization he talks about (he's welcomed to borrow those bits if he can).

Great, instead of inventing yet another wheel, why not send him the tests and patch implementing those features?

[reply]

Re: Re: Re: Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html

by hossman (Prior) on Apr 19, 2003 at 03:37 UTC

Re: Re: Re: Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html

by PodMaster (Abbot) on Apr 19, 2003 at 01:26 UTC

Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html
by simon.proctor (Vicar) on Apr 18, 2003 at 17:15 UTC

my $list = [
    'onabort',
    'onactivate',
    'onafterprint',
    'onafterupdate',
    'onbeforeactivate',
    'onbeforecopy',
    'onbeforecut',
    'onbeforedeactivate',
    'onbeforeeditfocus',
    'onbeforepaste',
    'onbeforeprint',
    'onbeforeunload',
    'onbeforeupdate',
    'onblur',
    'onbounce',
    'oncellchange',
    'onchange',
    'onclick',
    'oncontextmenu',
    'oncontrolselect',
    'oncopy',
    'oncut',
    'ondataavailable',
    'ondatasetchanged',
    'ondatasetcomplete',
    'ondblclick',
    'ondeactivate',
    'ondrag',
    'ondragend',
    'ondragenter',
    'ondragleave',
    'ondragover',
    'ondragstart',
    'ondrop',
    'onerror',
    'onerrorupdate',
    'onfilterchange',
    'onfinish',
    'onfocus',
    'onfocusin',
    'onfocusout',
    'onhelp',
    'onkeydown',
    'onkeypress',
    'onkeyup',
    'onlayoutcomplete',
    'onload',
    'onlosecapture',    
    'onmousedown',
    'onmouseenter',
    'onmouseleave',
    'onmousemove',
    'onmouseover',
    'onmouseout',
    'onmouseup',
    'onmousewheel',
    'onmove',
    'onmoveend',
    'onmovestart',
    'onpaste',
    'onpropertychange',
    'onreadystatechange',
    'onreset',
    'onresize',
    'onresizeend',
    'onresizestart',
    'onrowenter',
    'onrowexit',
    'onrowsdelete',
    'onrowsinserted',
    'onscroll',
    'onselect',
    'onselectionchange',
    'onselectstart',
    'onstart',
    'onstop',
    'onsubmit',
    'onunload'    
];
[download]

[reply]
[d/l]

Re^2: HTML::Scrubber - Perl extension for scrubbing/sanitizing html (turn "on" off)

by tye (Sage) on Apr 18, 2003 at 21:05 UTC

This screams for 'on*' => 0 or perhaps qr/^on/i => 0 support. (:

tye

[reply]
[d/l]
[select]

Re: Re^2: HTML::Scrubber - Perl extension for scrubbing/sanitizing html (turn "on" off)

by Anonymous Monk on Apr 19, 2003 at 09:54 UTC

Hi, could you explain this? I haven't seen the notation before. Does it mean instead of onunload etc, he would just have 'load' as the list entry?

[reply]

Re^4: HTML::Scrubber - Perl extension for scrubbing/sanitizing html (turn "on" off)

by tye (Sage) on Apr 20, 2003 at 17:06 UTC

Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html
by Jenda (Abbot) on Apr 18, 2003 at 14:58 UTC

To add to the mess ... I have my own HTML filtering module HTML::JFilter. I considered uploading it to CPAN, but found HTML::TagFilter and wasn't sure I should do so then. The biggest difference of HTML::JFilter and these other two is that the HTML::JFilter is supposed to get the list of tags&attributes to allow as a single, easy to edit string. I need that so that the admins of our system are able to modify the filtering of different fields. (The system is used to distribute job offers to different sites, each site may allow different subset of HTML.)

I wanted to look at the code in HTML::TagFilter to see whether it would be viable to merge it with HTML::JFilter, but did not have time to do that yet.

Jenda
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
-- Rick Osborne

Edit by castaway: Closed small tag in signature

[reply]

Re: Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html

by PodMaster (Abbot) on Apr 18, 2003 at 15:09 UTC

That reminds me, I forgot to add

marked_sections => 1 
attr_encoded    => 1
[download]

Not enough arguments for Test::skip at test.pl line 51, near ""You don't have Data::Compare\n")"

[reply]
[d/l]
[select]

Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html
by mojotoad (Monsignor) on Oct 07, 2004 at 14:09 UTC

FYI, I think you must be using a windows editor. When you build the distro for your module on Solaris (2.8 in this case), the ^M characters at the end of every line appear to break the POD to man building process. Removing the trailing ^M characters beforehand fixes the problem. I must say I was suprised that this breaks the POD interface. Of course, this is also perl version 5.005_03, which means I'm probably using some woefully archaic version of the POD tools.

Since your the PodMaster and all, I thought you'd like to know!

Cheers,
Matt

[reply]
[d/l]
[select]


Your skill will accomplish what the force of many cannot
	PerlMonks