Dynamically cleaning up HTML fragments

SilasTheMonk has asked for the wisdom of the Perl Monks concerning the following question:

I need to be able to take a fragment of HTML and clean it up. We can assume that the outermost element is a "div", though if it was not necessary to assume that it would be a bonus. By "clean up" I mean

The fragment should be valid XML. In particular I know that my source is not closing img tags properly.
Any tags apart from: p, a, img, h3, div, em, strong should be stripped.
For any supported tag all but certain attributes must be stripped off.
Some attributes may require further processing such as removing non-local src or href attributes.
We should reject tags that do not have certain mandatory attibutes. No "a" without an href for example.
It must either already be in Debian or just require perl packaging.
It must be configurable and if possible extendible.
I would rather not be starting a new open source project. I know this is an old problem.
Degenerate things like empty paragraphs should be removed.
I should be able to turn it into a Data::FormValidator::Filters though I cannot really see how something could meet all the other criteria and not this one.

Below is a list of things I have tried. Things mentioned in Simplify HTML programatically but not mentioned below I have consciously for some reason. The XML::SAX looks the most promising to me but the one I have made least progress with. Any ideas?

HTML::Scrubber

This module seemed at first to meet all the criteria -- until I spotted the issue with unclosed "img" tags. According to the bug reports there is also an issue with it not recognising self-closing tags there is an easy work around rt://25477 for that. That work around does seem to help at all. I guess if all fails I use this module and apply another filter to fix the "img" tags but this is ugly.

HTML::Tidy

This is not itself in Debian but the underlying library is so I could easily get it added. The perl library itself has appalling reviews. It looks to me as if it may not do everything I want but combined with HTML::Scrubber may be it would.

XML::SAX

This was my experiment of the day. From what I could read of the docs it looks as if it could do everything I want though requiring some code to be written. However the docs are extremely light -- mostly point at Java documentation. I was experimenting with this as shown below. However once I introduced an attempt to close off open img tags it started giving an error

mismatched tag at line 2, column 53, byte 141 at /usr/lib/perl5/XML/Parser.pm line 187

As you can see I am not experienced with SAX.

#!/usr/bin/perl 

use warnings;
use strict;
use Perl6::Slurp;

my $output = "";
use XML::SAX::Machines qw(Pipeline);
#use XML::SAX::ParserFactory;

my $machine = Pipeline(MySAXHandler => \$output);

$machine->parse_string( join "", slurp $ARGV[0] );
print "$output\n";

package MySAXHandler;
use base qw(XML::SAX::Base);


sub start_document {
    my $self = shift;
    $self->{_supported} = {
        img=>{
            alt=>1,
            width=>1,
            height=>1,
            src=>1,
            title=>1,
        },
        a=>{
            href=>1,
            title=>1,
        },
        p=>{},
        h3=>{},
        em=>{},
        strong=>{},
        div=>{},
    };
    return $self->SUPER::start_document(shift);
}

sub start_element {
    my ($self, $el) = @_;
    my $localName = $el->{LocalName};
    
    if (exists $self->{_pending_img}) {
        my %el = %{$self->{_pending_img}};
        delete $self->{_pending_img};
        delete $el{Attributes};
        $self->SUPER::end_element(\%el);
    }
    
    if (exists $self->{_supported}->{$localName}) {  
        
        my $attributes = $self->{_supported}->{$localName};
        foreach my $attr (keys %{$el->{Attributes}}) {
           my $key = $attr;
           $key =~ s[\A{}][]xms;
           if (not exists $attributes->{$key}) {
               delete $el->{Attributes}->{$attr};
           }
        }
        
        if ($localName eq 'img') {
            $self->{_pending_img} = $el;
        }
        
        return $self->SUPER::start_element($el);
    }    
}

sub end_element {
    my ($self, $el) = @_;
    my $localName = $el->{LocalName};

    if (exists $self->{_pending_img} and $localName ne 'img') {
        my %el = %{$self->{_pending_img}};
        delete $self->{_pending_img};
        delete $el{Attributes};
        $self->SUPER::end_element(\%el);
    }

    if (exists $self->{_supported}->{$localName}) {  
        return $self->SUPER::end_element($el);
    }    
}

sub character {
    my ($self, $el) = @_;

    if (exists $self->{_pending_img}) {
        my %el = %{$self->{_pending_img}};
        delete $self->{_pending_img};
        delete $el{Attributes};
        $self->SUPER::end_element(\%el);
    }

    return $self->SUPER::character($el);
}

1
[download]

HTML::TreeBuilder

This does quite a good job of closing off the "img" tag but it does no cleaning. It also puts in "html" and "body" tags which actually I don't want but can at least be easily cleaned off. I have tried combining it with HTML::Scrubber but that just demonstrates the issues with self closing tags.

#!/usr/bin/perl

use strict;
use warnings;
use Carp;

use HTML::TreeBuilder;
use HTML::Scrubber;
use Perl6::Slurp;

my $tidy = HTML::TreeBuilder->new();
my $scrubber = HTML::Scrubber->new(
    allow => [ qw[ p em stong a img ] ],
    rules => [
        img => {
            src => 1,
            alt => 1,
            title => 1,
            width => 1,
            height => 1,
        },
        a => {
            href=>1,
            title=>1,
        },
    ],
);
$scrubber->{_p}->empty_element_tags(1);

my $html = slurp $ARGV[0];

$tidy->no_expand_entities(1);
$tidy->p_strict(1);
print $scrubber->scrub($tidy->parse_content($html)->as_XML);
[download]

Edit:

Comment on Dynamically cleaning up HTML fragments Select or Download Code

Replies are listed 'Best First'.
Re: Dynamically cleaning up HTML fragments by wfsp (Abbot) on Sep 24, 2010 at 08:12 UTC
I highly recommend having a look at Dave Raggett's HTML Tidy. I've found it to be a very nifty bit of kit for these types of jobs. Careful tweaking of the config would, I beleive, achieve many of the tasks you are looking at.	[reply]
Re^2: Dynamically cleaning up HTML fragments by SilasTheMonk (Chaplain) on Sep 24, 2010 at 11:41 UTC
Actually HTML::Tidy seems to have a bit of bad history at Debian. My original post that it is not in Debian is wrong, but its definitely in an odd state. I am investigating.	[reply]
Re^3: Dynamically cleaning up HTML fragments by wfsp (Abbot) on Sep 25, 2010 at 10:32 UTC
Ubuntu 8.04, perl 5.10.1 HTML::Tidy has been released three times this year (the last on 17 September) so some of the criticisms may have been addressed. It requires tidyp (version 1.04 recently released) which is a fork of tidy. I was able to install tidyp in the usual way and H::T installed without fuss using cpanp. #! /usr/bin/perl use strict; use warnings; use HTML::Tidy; my $tidy = HTML::Tidy->new( { output_xhtml => 1, tidy_mark => 0, markup => 1, q{show-body-only} => 1, } ); printf qq{tidyp: %s\n}, $tidy->tidyp_version; printf qq{libtidyp: %s\n}, $tidy->libtidyp_version; printf qq{HTML::Tidy: %s\n}, $HTML::Tidy::VERSION; my $html = do {local $/;<DATA>}; $tidy->parse(q{test.html}, $html) or die q{parse failed}; for my $message ($tidy->messages){ print $message->as_string, qq{\n}; } my $xhtml = $tidy->clean($html); print $xhtml; __DATA__ <div> <p>tidy</p> <img src="pic.jpg"> </div> [download] `tidyp: 1.04 libtidyp: 1.04 HTML::Tidy: 1.54 test.html (1:1) Warning: missing <!DOCTYPE> declaration test.html (1:1) Warning: inserting implicit <body> test.html (1:1) Warning: inserting missing 'title' element test.html (3:3) Warning: <img> lacks "alt" attribute <div> <p>tidy</p> <img src="pic.jpg" /></div>` [download] See the tidy quick reference for all the configuration options.	[reply] [d/l] [select]
Re^4: Dynamically cleaning up HTML fragments by SilasTheMonk (Chaplain) on Sep 25, 2010 at 20:43 UTC
Re^2: Dynamically cleaning up HTML fragments by petdance (Parson) on Sep 26, 2010 at 04:26 UTC
tidyp is a fork of Dave's tidy, because the people who maintain tidy do not do releases. Without releases, it is all but impossible to build HTML::Tidy atop of it. xoxo, Andy	[reply]
Re: Dynamically cleaning up HTML fragments by halfcountplus (Hermit) on Sep 24, 2010 at 00:13 UTC
>>We should reject tags that do not have certain mandatory attibutes. No "a" without an href for example. Local anchors do not have "href" as an attribute, they have a "name", eg <a name="local page anchor">here</a> ;) Are you familiar with the event driven HTML::Parser? I have not used it to clean invalid source, but here's an idea (eg): when your start tag handler hits an img tag, put the entire tag text into an otherwise null global. In both the end and start tag handler, you check this global for content; if the img is not closed add / to it. Alternately, I can tell you for a fact that HTML::Parser treats / used XHTMLishly (ie, not the first character in the tag) as an attribute. Therefore, with tags like image you can check for the / attribute and if not present, do the edit. The rest of your requirements -- stripping certain tags, working with attributes, checking for text inside p tags, etc. -- can also easily be accomplished via HTML::Parser, but you will have to write some code to do it.	[reply]
Re^2: Dynamically cleaning up HTML fragments by Anonymous Monk on Sep 24, 2010 at 01:04 UTC
Fragment identifiers should be id attributes, which can go on any element and are unique within the document, instead of name attributesm, which need not be unique and only belong on certain elements.	[reply]
Re^3: Dynamically cleaning up HTML fragments by SilasTheMonk (Chaplain) on Sep 24, 2010 at 15:23 UTC
Of course I have. However this is completely irrelevant to the question. The HTML fragments that I have in mind have no need of either "name" or "id" attributes. Of course someone else with the same general question but different specifics may require them which is why configurability is one of the criteria.	[reply]
Re: Dynamically cleaning up HTML fragments by bellaire (Hermit) on Sep 24, 2010 at 01:04 UTC
I have not used it extensively, but another module that looks really neat for parsing and "tidying" HTML is Marpa-HTML. Their html_fmt demo does handling of missing start and end tags, and the dist's documentation talks about being able to selectively eliminate certain types of tag.	[reply]
Re: Dynamically cleaning up HTML fragments by trwww (Priest) on Sep 24, 2010 at 07:34 UTC
I think the default driver for SAX checks for well-formed-ness of the input before forwarding the events. I provided an example for someone looking to do something similar back in Re: Simplify HTML programatically. There you can see how to start your pipeline using the HTML driver.	[reply]
Re^2: Dynamically cleaning up HTML fragments by SilasTheMonk (Chaplain) on Sep 24, 2010 at 09:39 UTC
Yes I looked at that node as you can see from my original post. I tried your example which worked as far as it went. However I could not see what the HTML driver actually contributed, and since it depends on HTML::TreeBuilder all the advantages of using a SAX parser are undermined. My example (which also worked upto a point) did not have a dependency. My example only broke down when I tried to address the closing "img" issue. If you know how to fix that I would really appreciate it.	[reply]
Re: Dynamically cleaning up HTML fragments by dHarry (Abbot) on Sep 24, 2010 at 11:30 UTC
Although I am biased towards XML solutions, in this particular case I would choose a different approach. Most if not all (compliant) parsers need an XML document at least to be wellformed in order to parse them correctly. (As you have discovered yourself with your SAX example.) As you XML is not well formed the "XML approach" doesn't make a lot of sense to me. I like some of the other suggestions like html tidy.	[reply]
Re: Dynamically cleaning up HTML fragments by petdance (Parson) on Sep 26, 2010 at 04:28 UTC
Those appalling reviews are one of the big problems with the CPAN review system. They were well-deserved, because there were horrible problems with building libtidy and HTML::Tidy atop of it. Now that I have forked tidy to tidyp, HTML::Tidy should build just fine. Alas, those reviews are still there, telling people not to use HTML::Tidy. :-( xoxo, Andy	[reply]
Re: Dynamically cleaning up HTML fragments by clinton (Priest) on Sep 25, 2010 at 19:22 UTC
Glad to see that you have noticed HTML::StripScripts::Parser. I'm the maintainer, but not the guy who did the great work of writing it originally. It fulfils all of your listed requirements, and is certainly seeing active usage on our production sites. This code should do what you need (untested): my $s = HTML::Stripscripts::Parser->new({ Context => 'Flow', # Only allow these tags BanAllBut => [qw(p a img h3 div em)], # Allow src and href AllowSrc => 1, AllowHref => 1, Rules => { # remove empty p tags p => sub { return length $_[1]->{content} }, # a must have a local href a => { href => \&strip_abs_uri, tag => sub { return 0 unless $_[1]->{href} }, }, # img must have a local src img => { src => \&strip_abs_uri, tag => sub { return 0 unless $_[1]->{src} }, }, # Allow id and class for all tags '*' => { id => 1, class => 1, } }, }); sub strip_abs_uri { my ( $filter, $tag, $attr_name, $attr_val ) = @_; return 1 unless $attr_name =~/href\|src/ return $attr_val=~m{://}; } print $s->filter_html($html); [download]	[reply] [d/l]
Re^2: Dynamically cleaning up HTML fragments by SilasTheMonk (Chaplain) on Sep 25, 2010 at 20:57 UTC
Thanks. This module really is working for me. In fact it is the ONLY module that meets my requirements. HTML::Restrict might work but it uses Moose. Actually I want "title" attributes on anchors and I did not not like the handling of stripped code, so I had to subclass and add a few method overrides. But altogether it is petty easy to use. I am building up some test cases and adding in Benchmark'ing. It looks like writing a HTML::Parser subclass might be the only alternative (and faster) but requiring some skill. Have you thought of writing a module that takes a HTML::StripScripts spec and "compiles" it to a faster, slimmer direct subclass of HTML::Parser?	[reply]
Re^3: Dynamically cleaning up HTML fragments by clinton (Priest) on Sep 25, 2010 at 21:08 UTC
Glad it is working for you. I really do not recommend writing your own HTML::Parser subclass. If you look at the source of HTML::StripScripts you will see that there is a lot going on there, and with good reason. If you write your own subclass, and you're not willing to spend the time checking every last detail, then you are likely to miss a whole lot of corner cases that HSS already deals with. Parsing HTML is a hard job, and even harder when you're trying to make sense of bad HTML. (Again, I write as the fortunate maintainer, and not as the original author who did all the painstaking work.) clint	[reply]