Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

HTML::Scrubber - Perl extension for scrubbing/sanitizing html

by PodMaster (Abbot)
on Apr 18, 2003 at 10:26 UTC ( [id://251427]=sourcecode: print w/replies, xml ) Need Help??
Category: HTML Utility
Author/Contact Info /msg podmaster
Description:
use HTML::Scrubber; my $html = q[ <B> bold </B> <i> italic </i> <u> underlined </u> ]; my $scrubber = HTML::Scrubber->new( allow => [ qw[ p b i u hr br ] ] ); print $scrubber->scrub($html); $scrubber->deny( qw[ p b i u hr br ] ); print $scrubber->scrub($html); __END__
bold italic underlined bold italic underlined

Grab a vanilla tarball here.

update:
I've done the TODO, and uploaded HTML-Scrubber-0.02.tar.gz to CPAN. Version 0.01 kept here for historical purposes.

=head1 NAME

HTML::Scrubber - Perl extension for scrubbing/sanitizing html

=head1 SYNOPSIS

=for example begin

    #!/usr/bin/perl -w
    use HTML::Scrubber;
    use strict;
                                                                      
+      #
    my $html = q[
        <HR>                                                          
+      #
        <B> bold                                                      
+      #
            <U> underlined                                            
+      #
                <I>                                                   
+      #
                    <A href=#>  LINK    </A>                          
+      #
                </I>                                                  
+      #
            </U>                                                      
+      #
        </B>                                                          
+      #
        </HR>                                                         
+      #
    ];
                                                                      
+      #
    my $scrubber = HTML::Scrubber->new( allow => [ qw[ p b i u hr br ]
+ ] );
                                                                      
+      #
    print $scrubber->scrub($html);
                                                                      
+      #
    $scrubber->deny( qw[ p b i u hr br ] );
                                                                      
+      #
    print $scrubber->scrub($html);
                                                                      
+      #

=for example end

=head1 DESCRIPTION

If you wanna "scrubbing"/"sanitize" html input, this be the modulehtml

I wasn't satisfied with HTML::Sanitizer because it is
based on HTML::TreeBuilder,
so I thought I'd write something similar
that works directly with HTML::Parser.

=head1 How does it work?

First a note on documentation: just study the example below.
It's all the documentation you could need.
If you're new to perl, good luck to you.

When a tag is encountered, HTML::Scrubber
allows/denies the tag using the explicit rule if one exists.

If no explicit rule exists, Scrubber applies the default rule.

If an explicit rule exists,
but it's a simple rule(0 or 1),
the default attribute rule is applied.

=head2 EXAMPLE

=for example begin

    #!/usr/bin/perl -w
    use HTML::Scrubber;
    use strict;
                                                                      
+      #
    my @allow = qw[ br hr b a ];
                                                                      
+      #
    my @rules = (
        script => 0,
        img => {
            src => qr{^(?!http://)}i, # only relative image links allo
+wed
            alt => 1,
            '*' => 0,
        },
    );
                                                                      
+      #
    my @default = (
        0   =>    # default rule, deny all tags
        {
            '*'           => 1, # default rule, allow all attributes
            'href'        => qr{^(?!(?:java)?script)}i,
            'src'         => qr{^(?!(?:java)?script)}i,
    #   If your perl doesn't have qr
    #   just use a string with length greater than 1
            'cite'        => '(?i-xsm:^(?!(?:java)?script))',
            'language'    => 0,
            'name'        => 1, # could be sneaky, but hey ;)
            'onblur'      => 0,
            'onchange'    => 0,
            'onclick'     => 0,
            'ondblclick'  => 0,
            'onerror'     => 0,
            'onfocus'     => 0,
            'onkeydown'   => 0,
            'onkeypress'  => 0,
            'onkeyup'     => 0,
            'onload'      => 0,
            'onmousedown' => 0,
            'onmousemove' => 0,
            'onmouseout'  => 0,
            'onmouseover' => 0,
            'onmouseup'   => 0,
            'onreset'     => 0,
            'onselect'    => 0,
            'onsubmit'    => 0,
            'onunload'    => 0,
            'src'         => 0,
            'type'        => 0,
        }
    );
                                                                      
+      #
    my $scrubber = HTML::Scrubber->new();
    $scrubber->allow( @allow );
    $scrubber->rules( @rules ); # key/value pairs
    $scrubber->default( @default );
    $scrubber->comment(1); # 1 allow, 0 deny
                                                                      
+      #
    ## preferred way to create the same object
    $scrubber = HTML::Scrubber->new(
        allow   => \@allow,
        rules   => \@rules,
        default => \@default,
        comment => 1,
        process => 0,
    );
                                                                      
+      #
    require Data::Dumper,die Data::Dumper::Dumper($scrubber) if @ARGV;
                                                                      
+      #
    my $it = q[
        <?php   echo(" EVIL EVIL EVIL "); ?>    <!-- asdf -->
        <hr>
        <I FAKE="attribute" > IN ITALICS WITH FAKE="attribute" </I><br
+>
        <B> IN BOLD </B><br>
        <A NAME="evil">
            <A HREF="javascript:alert('die die die');"> HREF=JAVASCRIP
+T &lt;!&gt; </A>
            <br>
            <A HREF="image/bigone.jpg" ONMOUSEOVER="alert('die die die
+');"> 
                <IMG SRC="image/smallone.jpg" ALT="ONMOUSEOVER JAVASCR
+IPT">
            </A>
        </A> <br> 
    ];
                                                                      
+      #
    print "#original text",$/, $it, $/;
    print
        "#scrubbed text (default ",
        $scrubber->default(), # no arguments returns the current value
        " comment ",
        $scrubber->comment(),
        " process ",
        $scrubber->process(),
        " )",
        $/,
        $scrubber->scrub($it),
        $/;
                                                                      
+      #
    $scrubber->default(1); # allow all tags by default
    $scrubber->comment(0); # deny comments
                                                                      
+      #
    print
        "#scrubbed text (default ",
        $scrubber->default(),
        " comment ",
        $scrubber->comment(),
        " process ",
        $scrubber->process(),
        " )",
        $/,
        $scrubber->scrub($it),
        $/;
                                                                      
+      #
    $scrubber->process(1);        # allow process instructions (danger
+ous)
    $default[0] = 1;              # allow all tags by default
    $default[1]->{'*'} = 0;       # deny all attributes by default
    $scrubber->default(@default); # set the default again
                                                                      
+      #
    print
        "#scrubbed text (default ",
        $scrubber->default(),
        " comment ",
        $scrubber->comment(),
        " process ",
        $scrubber->process(),
        " )",
        $/,
        $scrubber->scrub($it),
        $/;

=for example end

=cut

package HTML::Scrubber;
use HTML::Parser();
use HTML::Entities;
use vars qw[ $VERSION ];
use strict;

$VERSION = '0.01';

sub new {
    my $package = shift;
    my $p = HTML::Parser->new(
        api_version => 3,
        default_h => [\&_scrub, "self, event, tagname, attr, text"],
    );

    my $self = {
        _p => $p,
        _rules => {
            '*' => 0,
        },
        _comment => 0,
        _process => 0,
        _r => "",
    };

    $p->{"\0_s"} = bless $self, $package;

    return $self unless @_;

    my %args = @_;

    for my $f( qw[ default allow deny rules process comment ] ) {
        next unless exists $args{$f};
        if( ref $args{$f} ) {
            $self->$f( @{ $args{$f} } ) ;
        } else {
            $self->$f( $args{$f} ) ;
        }
    }

    return $self;
}

sub _validate {
    my($t, $r, $a) = @_;

    return "<$t>" unless %$a;

    my %f;

    for my $k( keys %$a ) {
        if( exists $r->{$k} ) {
            if( ref $r->{$k} || length($r->{$k}) > 1 ) {
                $f{$k} = $a->{$k} if $a->{$k} =~ m{$r->{$k}};
            } elsif( $r->{$k} ) {
                $f{$k} = $a->{$k};
            }
        } elsif( exists $r->{'*'} and $r->{'*'} ) {
            $f{$k} = $a->{$k};
        }
    }
    
    return "<$t $r>"
        if $r = join ' ',
                map {
                    qq[$_="]
                    .encode_entities($f{$_})
                    .q["]
                } keys %f;

    return "<$t>";
}

sub _scrub {
    my( $p, $e, $t, $a, $text ) = @_;
    my $s = $p->{"\0_s"} ;

    if ( $e eq 'start' ) {
        if( exists $s->{_rules}->{$t} ) {  # is there a specific rule
            if( ref $s->{_rules}->{$t} ) { # is it complicated?(not si
+mple;)
                $s->{_r} .= _validate($t, $s->{_rules}->{$t}, $a);
            } elsif( $s->{_rules}->{$t} ) {
        # validate using default attribute rule
                $s->{_r} .= _validate($t, $s->{_rules}->{'_'}, $a);
            }
        } elsif( $s->{_rules}->{'*'} ) { # default allow tags
            $s->{_r} .= _validate($t, $s->{_rules}->{'_'}, $a);
        }
    } elsif ( $e eq 'end' ) {
        if( exists $s->{_rules}->{$t} ) {
            $s->{_r} .= "</$t>" if $s->{_rules}->{$t};
        } elsif( $s->{_rules}->{'*'} ) {
            $s->{_r} .= "</$t>";
        }
    } elsif ( $e eq 'comment' ) {
        $s->{_r} .= $text if $s->{_comment};
    } elsif ( $e eq 'process' ) {
        $s->{_r} .= $text if $s->{_process};
    } elsif ( $e eq 'text' or $e eq 'default') {
        $s->{_r} .= $text;
    } elsif ( $e eq 'start_document' ) {
        $s->{_r} = "";
    }
}

sub comment {
    return
        $_[0]->{_comment}
            if @_ == 1;
    $_[0]->{_comment} = $_[1];
    return();
}

sub process {
    return
        $_[0]->{_process}
            if @_ == 1;
    $_[0]->{_process} = $_[1];
    return();
}

sub allow {
    my $self = shift;
    $self->{_rules}{$_}=1 for @_;
#    $self->{_p}->report_tags(@_); # I can't do this one ;(
    return();
}

sub deny {
    my $self = shift;
    $self->{_rules}{$_} = 0 for @_;
#    $self->{_p}->ignore_tags(@_); # optimization ;)
    return();
}

sub rules{
    my $self = shift;
    my %rules = @_;
    for my $k(keys %rules) {
        $self->{_rules}{$k} = $rules{$k};
    }
    return();
}

sub default {
    return
        $_[0]->{_rules}{'*'}
            if @_ == 1;
    $_[0]->{_rules}{'*'} = $_[1] if defined $_[1];
    $_[0]->{_rules}{'_'} = $_[2] if defined $_[2];
    return();
}

sub scrub_file {
    $_[0]->{_p}->parse_file($_[1]);
    return delete $_[0]->{_r};
}

sub scrub {
    $_[0]->{_p}->parse($_[1]);
    $_[0]->{_p}->eof();
    return delete $_[0]->{_r};
}

1;

#print sprintf q[ '%-12s => %s,], "$_'", $h{$_} for sort keys %h;# per
+l!

=head2 FUN

If you don't know how to install this, try

    use ExtUtils::MakeMaker;
    WriteMakefile(
        'NAME'        => 'HTML::Scrubber',
        'VERSION_FROM'    => 'Scrubber.pm', # finds $VERSION
    );


If you have Test::Inline, try

    pod2test Scrubber.pm >scrubber.t
    perl scrubber.t

=head1 CAVEATS

The interface is subject to change until I release this on CPAN.

The allow/deny/default methods are likely to go away,
so I can implement some optimizations (see L<TODO|"TODO"> for details)
+.

=head1 TODO


Top priority, allow for redirecting output to a
file || filehandle instead of
just returning a giant string.


Currently, 

    $p->ignore_tags( TAG, ... )

is used as an optimization.
I need to figure out the semantics of using

    $p->report_tags( TAG, ... )

as an optimization option.
See L<HTML::Parser|HTML::Parser> if you're curious.


I have no plans at all of using
    $p->ignore_elements( TAG, ... )

=head1 SEE ALSO

L<HTML::Parser>, L<Test::Inline>, L<HTML::Sanitizer>.

=head1 AUTHOR

D.H aka PodMaster

=for in_the_future
Please use http://rt.cpan.org/ to report bugs.
http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Scrubber

=head1 LICENSE

Copyright (c) 2003 by D.H. aka PodMaster.
All rights reserved.

This module is free software;
you can redistribute it and/or modify it under
the same terms as Perl itself.

=cut
Replies are listed 'Best First'.
Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html
by Hero Zzyzzx (Curate) on Apr 18, 2003 at 14:17 UTC

    What does this do that HTML::TagFilter doesn't? HTML::Scrubber and HTML::TagFilter appear to attack the exact same problem with nearly the same interface.

    -Any sufficiently advanced technology is
    indistinguishable from doubletalk.

      Everything ;D

      I wasn't aware of HTML::TagFilter. Like I say in the description, I wasn't satisfied with HTML::Sanitizer, so I wrote HTML::Scrubber.

      HTML::Scrubber and HTML::TagFilter appear to attack the exact same problem with nearly the same interface.
      The same could be said of HTML::Sanitizer.

      Looking at TagFilter's TO DO, I already have tests and do the HTML::Parser level optimization he talks about (he's welcomed to borrow those bits if he can).

      I don't intend to expand HTML::Scrubber beyond the interface you see now (like he does). I like my simple interface.

      HTML::Scrubber is never going to do any rule based translations (like changing all font tags with size=2 to span class=bar, or whatever).

      ++Hero Zzyzzx, it's nice to see others play merlyn ;D


      MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
      I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
      ** The Third rule of perl club is a statement of fact: pod is sexy.

        Looking at TagFilter's TO DO, I already have tests and do the HTML::Parser level optimization he talks about (he's welcomed to borrow those bits if he can).

        Great, instead of inventing yet another wheel, why not send him the tests and patch implementing those features?

Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html
by simon.proctor (Vicar) on Apr 18, 2003 at 17:15 UTC
    There is a load of IE only JavaScript events that you may want to add. I've written a suite based on similar lines but works on re-formatting and simplifying according to a rule set (sadly I cannot share the code). I can, however, share the events that I have:
    my $list = [ 'onabort', 'onactivate', 'onafterprint', 'onafterupdate', 'onbeforeactivate', 'onbeforecopy', 'onbeforecut', 'onbeforedeactivate', 'onbeforeeditfocus', 'onbeforepaste', 'onbeforeprint', 'onbeforeunload', 'onbeforeupdate', 'onblur', 'onbounce', 'oncellchange', 'onchange', 'onclick', 'oncontextmenu', 'oncontrolselect', 'oncopy', 'oncut', 'ondataavailable', 'ondatasetchanged', 'ondatasetcomplete', 'ondblclick', 'ondeactivate', 'ondrag', 'ondragend', 'ondragenter', 'ondragleave', 'ondragover', 'ondragstart', 'ondrop', 'onerror', 'onerrorupdate', 'onfilterchange', 'onfinish', 'onfocus', 'onfocusin', 'onfocusout', 'onhelp', 'onkeydown', 'onkeypress', 'onkeyup', 'onlayoutcomplete', 'onload', 'onlosecapture', 'onmousedown', 'onmouseenter', 'onmouseleave', 'onmousemove', 'onmouseover', 'onmouseout', 'onmouseup', 'onmousewheel', 'onmove', 'onmoveend', 'onmovestart', 'onpaste', 'onpropertychange', 'onreadystatechange', 'onreset', 'onresize', 'onresizeend', 'onresizestart', 'onrowenter', 'onrowexit', 'onrowsdelete', 'onrowsinserted', 'onscroll', 'onselect', 'onselectionchange', 'onselectstart', 'onstart', 'onstop', 'onsubmit', 'onunload' ];
    Some, like the mouse wheel, are IE 6 only but nm :)

    HTH

    SP

      This screams for 'on*' => 0 or perhaps qr/^on/i => 0 support. (:

                      - tye

        Hi, could you explain this? I haven't seen the notation before. Does it mean instead of onunload etc, he would just have 'load' as the list entry?

Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html
by Jenda (Abbot) on Apr 18, 2003 at 14:58 UTC

    To add to the mess ... I have my own HTML filtering module HTML::JFilter. I considered uploading it to CPAN, but found HTML::TagFilter and wasn't sure I should do so then. The biggest difference of HTML::JFilter and these other two is that the HTML::JFilter is supposed to get the list of tags&attributes to allow as a single, easy to edit string. I need that so that the admins of our system are able to modify the filtering of different fields. (The system is used to distribute job offers to different sites, each site may allow different subset of HTML.)

    I wanted to look at the code in HTML::TagFilter to see whether it would be viable to merge it with HTML::JFilter, but did not have time to do that yet.

    Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

    Edit by castaway: Closed small tag in signature

      Not bad. I like the spec idea, and think the HTML::TagFilter dude would like it as well. You could also easily subclass HTML::Scrubber and re-create HTML::JFilter.

      That reminds me, I forgot to add

      marked_sections => 1 attr_encoded => 1
      BTW -- XML-DTDParser-1.7, Not enough arguments for Test::skip at test.pl line 51, near ""You don't have Data::Compare\n")".


      MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
      I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
      ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: HTML::Scrubber - Perl extension for scrubbing/sanitizing html
by mojotoad (Monsignor) on Oct 07, 2004 at 14:09 UTC
    Thanks for the module. Nice, simple interface.

    FYI, I think you must be using a windows editor. When you build the distro for your module on Solaris (2.8 in this case), the ^M characters at the end of every line appear to break the POD to man building process. Removing the trailing ^M characters beforehand fixes the problem. I must say I was suprised that this breaks the POD interface. Of course, this is also perl version 5.005_03, which means I'm probably using some woefully archaic version of the POD tools.

    Since your the PodMaster and all, I thought you'd like to know!

    Cheers,
    Matt

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://251427]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-04-19 19:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found