Should I just use HTML::Parser and shut up?

Yes!

Here is a filter example to get you going - its really quite easy once you get you head around how it works. I find the pod a little obscure but there are some good tutorials out there.

You should easily see how we check each opening and closing tag and add it if it is on the ok list - parser calls &start for opening tags and &end for closing tags. Similarly we add the text between the OK opening and closing tags as parser calls &text and we have flagged that we do or don't want this text. If you just want the text just don't add the tags. What could be easier?

#!/usr/bin/perl -w

package Filter;
use strict;
use base 'HTML::Parser';

my ($filter, $want_it);
my @ok_tags = qw ( h1 h2 h3 h4 p br );
my %ok_tags;
$ok_tags{$_}++ for @ok_tags;
 
sub start {
    my ($self, $tag, $attr, $attrseq, $origtext) = @_;
    if ( exists $ok_tags{$tag}) {
        $filter .= $origtext;
        $want_it = 1;
    } else {
        $want_it = 0;
    } 
}

sub text {
    my ($self, $text) = @_;
    $filter .= $text if $want_it; 
}

sub comment {
    # uncomment to no strip comments
    # my ($self, $comment) = @_;
    # $filter .= "<!-- $comment -->";
}

sub end {
    my ($self, $tag, $origtext) = @_; 
    $filter .= $origtext if exists $ok_tags{$tag};
}

my $parser = new Filter;
my $html = join '', <DATA>;
$parser->parse($html);
$parser->eof;

print $html;
print "\n\n------------------------\n\n";
print $filter;

__DATA__
<html>
<head>
  <title>Title</title>
</head>
<body>
<h1>Hello Parser</h1>
<p>You need HTML::Parser</p>
<h2>Parser rocks!</h2>
<a href="html.parser.com">html.parser.com</a>
<hr>
<pre>
  use HTML::Parser;
</pre>
<!-- HTML PARSER ROCKS! -->
</body>
</html>
[download]

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Comment on Re: Tag filtering: a standard mechanism? Download Code

Replies are listed 'Best First'.

Re: Re: Tag filtering: a standard mechanism?
by thpfft (Chaplain) on Sep 13, 2001 at 18:59 UTC

Thank you.

Unfortunately, it'll have to be a bit more complicated than that because i want to screen attributes and their values as well as tags, but I've got most of it done.

But there are two things I don't quite understand, and the docs are every bit as opaque as you suggested. I hope you can help:

update: both issues fixed by subclassing properly. All working now. Thanks for your help.

1. Should $filter and $want_it really be class variables? It feels like they should be tied to the $parser, at the very least, or better passed back directly from the methods. But i think i haven't quite grasped the event-driven model :(

2. Just clarification: HTML::Parser modifies text in place, it seems. Is there some way to stop it from printing upon eof? There must be, but i can't find it...

I'm trying to fold this into an existing system that uses TT2 and a subclass of Class::DBI, so I really need it to cooperate with TT's lovely lazy method calls, you see.

When it's done, should i post it here, or is this a rite of passage that everyone else has already passed through?

[reply]


Keep It Simple, Stupid
	PerlMonks