comment on

The problem arose when one of our customers wanted to feed formated text from a word97 doc file into an RDBMS to generate web pages with dynamic content. So I exported the text as HTML from word, opened the file in a text editor and was confronted with a horror of HTML (you may know that).

My first approach was to use HTML::Parser and a modified version of one of its example scripts to drop some tags (like <font>). HTML::Parser did a good job on that but left ugly things like <b><i> ... </b></i>, which isn't valid.

So I took a look at HTML::TreeBuilder and wrote the following sub to do the work. It works fine, but I want to ask my fellow monks for deeper knowledge.

Are there other ways to handle word's html output and get valid html from it? Please give me some directions (others than htmltidy which can't be used). Thanks.

# ... snippet ...
# tags to ignore
my @ignore_tags = qw(font big small body dir html);

# teags to drop with content
@ignore_elements = qw(script style head);


##########################################################
sub clean_up_htmltree {
##########################################################

  my $input = shift;
  my $warn = 0; 
  my $htmlex; 
  use HTML::TreeBuilder;

  my $h = HTML::TreeBuilder->new;
  $h->ignore_unknown(0);
  $h->warn($warn);
  $h->parse($input);

  foreach (@ignore_tags) {
    $htmlex = 1, next if lc($_) eq "html"; # remove <html>...</html>?
    while (my $ok = $h->look_down('_tag', "$_")) { 
      $ok->replace_with_content; 
    }
  }
  foreach (@ignore_elements) {
    while (my $ok = $h->look_down('_tag', "$_")) { 
      $ok->detach; 
    }
  }

  my $output = $h->as_HTML(undef, " ", {}); # entities to encode, inde
+nt, optional endtags
  $h = $h->delete(); # nuke it!
  if ($htmlex) {
    $output =~ s:^\s*<html>::m;
    $output =~ s:</html>\s*$::m;
  }
  return $output;
}
[download]

alex pleiner <alex@zeitform.de>
zeitform Internet Dienste

In reply to Converting Word97 (or later) exported HTML to valid HTML by projekt21

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


No such thing as a small change
	PerlMonks