The problem arose when one of our customers wanted to feed formated text from a word97 doc file into an RDBMS to generate web pages with dynamic content. So I exported the text as HTML from word, opened the file in a text editor and was confronted with a horror of HTML (you may know that).
My first approach was to use HTML::Parser and a modified version of one of its example scripts to drop some tags (like <font>). HTML::Parser did a good job on that but left ugly things like <b><i> ... </b></i>, which isn't valid.
So I took a look at HTML::TreeBuilder and wrote the following sub to do the work. It works fine, but I want to ask my fellow monks for deeper knowledge.
Are there other ways to handle word's html output and get valid html from it? Please give me some directions (others than htmltidy which can't be used). Thanks.
# ... snippet ...
# tags to ignore
my @ignore_tags = qw(font big small body dir html);
# teags to drop with content
@ignore_elements = qw(script style head);
##########################################################
sub clean_up_htmltree {
##########################################################
my $input = shift;
my $warn = 0;
my $htmlex;
use HTML::TreeBuilder;
my $h = HTML::TreeBuilder->new;
$h->ignore_unknown(0);
$h->warn($warn);
$h->parse($input);
foreach (@ignore_tags) {
$htmlex = 1, next if lc($_) eq "html"; # remove <html>...</html>?
while (my $ok = $h->look_down('_tag', "$_")) {
$ok->replace_with_content;
}
}
foreach (@ignore_elements) {
while (my $ok = $h->look_down('_tag', "$_")) {
$ok->detach;
}
}
my $output = $h->as_HTML(undef, " ", {}); # entities to encode, inde
+nt, optional endtags
$h = $h->delete(); # nuke it!
if ($htmlex) {
$output =~ s:^\s*<html>::m;
$output =~ s:</html>\s*$::m;
}
return $output;
}
alex pleiner <alex@zeitform.de>
zeitform Internet Dienste
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.