Tag filtering: a standard mechanism?

thpfft has asked for the wisdom of the Perl Monks concerning the following question:

I'm getting that 'look! I made a wheel!' feeling again. I've just put this in a script, not for the first time:

my $text = "<p>" . join('</p><p>', split(/[\r\n]{2,}/,$self->body_text())) . "</p>";

to create paragraphs in the body text of records coming out of a database, of course. Which everyone has to do. And then i'm going to start thinking about links, and then subheadings, and soon i'm going to find myself inventing a simple markup language that my users will be able to handle, and we all know that's a Sin.

I've dug around in cpan and here looking for something that will help me to format text nicely but still protect users from themselves and one another, and I've turned up very little. There's a lot of discussion, but not much code.

What i really need, and i assume the 47 million other people who write content-management systems or bulletin boards also need, is a standard and reasonably slim way of filtering input and output so that an arbitrary collection of html tags are allowed through but everything else is removed. If it can detect non-marked-up text and act appropriately, that's a bonus.

HTML::FromText does a good job of deducing markup from text formatting, but it doesn't handle headings very flexibly, or anything to do with document structure or linking. HTML::Filter - which is deprecated anyway- removes selected content between tags rather than the tags themselves. HTML::Parser is overkill here. HTML::SimpleParse is closest, but still at a bit of a tangent and rather overqualified for the job.

It's not hard to see how I could do this to my own requirements. Everything and Slashcode do it essentially the same way - a two step regex-based process of stripping out all but the allowed tags and stripping out all but the allowed attributes from those that remain.

But this seems like a simple thing that ought to exist, and things like that usually turn out to be on cpan somewhere. So, er, does it exist? Is anyone working on it? Should I try to write it? Is it possible to make it watertight? Should I just use HTML::Parser and shut up?

For building it my first choice would be to try subclassing or developing from HTML::SimpleParse, but it would have to be configurable from the outside, and very tolerant of broken markup, or at least able to return a useful complaint. Second choice would be to encapsulate the Everything approach.

Any comments please? A slap and a 'use HTML::Foo' would make me very happy...

Comment on Tag filtering: a standard mechanism? Download Code

Replies are listed 'Best First'.
Re: Tag filtering: a standard mechanism? by tachyon (Chancellor) on Sep 13, 2001 at 07:08 UTC
Should I just use HTML::Parser and shut up? Yes! Here is a filter example to get you going - its really quite easy once you get you head around how it works. I find the pod a little obscure but there are some good tutorials out there. You should easily see how we check each opening and closing tag and add it if it is on the ok list - parser calls &start for opening tags and &end for closing tags. Similarly we add the text between the OK opening and closing tags as parser calls &text and we have flagged that we do or don't want this text. If you just want the text just don't add the tags. What could be easier? #!/usr/bin/perl -w package Filter; use strict; use base 'HTML::Parser'; my ($filter, $want_it); my @ok_tags = qw ( h1 h2 h3 h4 p br ); my %ok_tags; $ok_tags{$_}++ for @ok_tags; sub start { my ($self, $tag, $attr, $attrseq, $origtext) = @_; if ( exists $ok_tags{$tag}) { $filter .= $origtext; $want_it = 1; } else { $want_it = 0; } } sub text { my ($self, $text) = @_; $filter .= $text if $want_it; } sub comment { # uncomment to no strip comments # my ($self, $comment) = @_; # $filter .= "<!-- $comment -->"; } sub end { my ($self, $tag, $origtext) = @_; $filter .= $origtext if exists $ok_tags{$tag}; } my $parser = new Filter; my $html = join '', <DATA>; $parser->parse($html); $parser->eof; print $html; print "\n\n------------------------\n\n"; print $filter; __DATA__ <html> <head> <title>Title</title> </head> <body> <h1>Hello Parser</h1> <p>You need HTML::Parser</p> <h2>Parser rocks!</h2> <a href="html.parser.com">html.parser.com</a> <hr> <pre> use HTML::Parser; </pre> <!-- HTML PARSER ROCKS! --> </body> </html> [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Re: Tag filtering: a standard mechanism? by thpfft (Chaplain) on Sep 13, 2001 at 18:59 UTC
Thank you. Unfortunately, it'll have to be a bit more complicated than that because i want to screen attributes and their values as well as tags, but I've got most of it done. But there are two things I don't quite understand, and the docs are every bit as opaque as you suggested. I hope you can help: update: both issues fixed by subclassing properly. All working now. Thanks for your help. 1. Should $filter and $want_it really be class variables? It feels like they should be tied to the $parser, at the very least, or better passed back directly from the methods. But i think i haven't quite grasped the event-driven model :( 2. Just clarification: HTML::Parser modifies text in place, it seems. Is there some way to stop it from printing upon eof? There must be, but i can't find it... I'm trying to fold this into an existing system that uses TT2 and a subclass of Class::DBI, so I really need it to cooperate with TT's lovely lazy method calls, you see. When it's done, should i post it here, or is this a rite of passage that everyone else has already passed through?	[reply]
Re: Tag filtering: a standard mechanism? by andreychek (Parson) on Sep 13, 2001 at 07:13 UTC
Hmm... would a templating system do what you are after? Taking an example from the HTML::Template documentation, perhaps you could try something like this: `In the template: <TMPL_LOOP NAME=EMPLOYEE_INFO> Name: <TMPL_VAR NAME=NAME> <P> Job: <TMPL_VAR NAME=JOB> <P> <P> </TMPL_LOOP> In the script: $template->param(EMPLOYEE_INFO => [ { name => 'Sam', job => 'programmer' }, { name => 'Steve', job => 'soda jerk' }, ] ); print $template->output(); The output: Name: Sam <P> Job: programmer <P> <P> Name: Steve <P> Job: soda jerk <P> <P>` [download] Again, that code is from the documentation, I just cut and pasted it. With that though, you can just create some sort of header and footer, and use HTML::Template's TMPL_INCLUDE directive to pull in more HTML content as necessary. This is just an example though, and there are plenty of templating systems available in Perl if this doesn't suit your needs. HTH! -Eric	[reply] [d/l]
Re: Re: Tag filtering: a standard mechanism? by thpfft (Chaplain) on Sep 13, 2001 at 13:59 UTC
Thanks, but that's not the same thing. A templating system is fine for arranging the output - I'm using TT2 - but it doesn't have anything to do with formatting the contents of individual fields. What i want to find or create is a minimal system for excluding all but a specified set of html tags and tag attributes from within a piece of text.	[reply]
Re: Tag filtering: a standard mechanism? by projekt21 (Friar) on Sep 13, 2001 at 12:29 UTC
You might have a look at UBB-Code, it is explained on http://ubbforums.infopop.com/cgi-bin/ultimatebb.cgi?ubb=ubb_code_page and uses a small set of tags to format input. I'm not sure if this can be considered a standard markup for what you need, but it looks pretty simple and a lot of sites use it. alex pleiner <alex@zeitform.de> zeitform Internet Dienste	[reply]


Perl-Sensitive Sunglasses
	PerlMonks

Tag filtering: a standard mechanism?

Yes!