http://qs321.pair.com?node_id=123551

projekt21 has asked for the wisdom of the Perl Monks concerning the following question:

The problem arose when one of our customers wanted to feed formated text from a word97 doc file into an RDBMS to generate web pages with dynamic content. So I exported the text as HTML from word, opened the file in a text editor and was confronted with a horror of HTML (you may know that).

My first approach was to use HTML::Parser and a modified version of one of its example scripts to drop some tags (like <font>). HTML::Parser did a good job on that but left ugly things like <b><i> ... </b></i>, which isn't valid.

So I took a look at HTML::TreeBuilder and wrote the following sub to do the work. It works fine, but I want to ask my fellow monks for deeper knowledge.

Are there other ways to handle word's html output and get valid html from it? Please give me some directions (others than htmltidy which can't be used). Thanks.

# ... snippet ... # tags to ignore my @ignore_tags = qw(font big small body dir html); # teags to drop with content @ignore_elements = qw(script style head); ########################################################## sub clean_up_htmltree { ########################################################## my $input = shift; my $warn = 0; my $htmlex; use HTML::TreeBuilder; my $h = HTML::TreeBuilder->new; $h->ignore_unknown(0); $h->warn($warn); $h->parse($input); foreach (@ignore_tags) { $htmlex = 1, next if lc($_) eq "html"; # remove <html>...</html>? while (my $ok = $h->look_down('_tag', "$_")) { $ok->replace_with_content; } } foreach (@ignore_elements) { while (my $ok = $h->look_down('_tag', "$_")) { $ok->detach; } } my $output = $h->as_HTML(undef, " ", {}); # entities to encode, inde +nt, optional endtags $h = $h->delete(); # nuke it! if ($htmlex) { $output =~ s:^\s*<html>::m; $output =~ s:</html>\s*$::m; } return $output; }

alex pleiner <alex@zeitform.de>
zeitform Internet Dienste