Converting Word97 (or later) exported HTML to valid HTML

projekt21 has asked for the wisdom of the Perl Monks concerning the following question:

The problem arose when one of our customers wanted to feed formated text from a word97 doc file into an RDBMS to generate web pages with dynamic content. So I exported the text as HTML from word, opened the file in a text editor and was confronted with a horror of HTML (you may know that).

My first approach was to use HTML::Parser and a modified version of one of its example scripts to drop some tags (like ). HTML::Parser did a good job on that but left ugly things like ... , which isn't valid.

So I took a look at HTML::TreeBuilder and wrote the following sub to do the work. It works fine, but I want to ask my fellow monks for deeper knowledge.

Are there other ways to handle word's html output and get valid html from it? Please give me some directions (others than htmltidy which can't be used). Thanks.

# ... snippet ...
# tags to ignore
my @ignore_tags = qw(font big small body dir html);

# teags to drop with content
@ignore_elements = qw(script style head);


##########################################################
sub clean_up_htmltree {
##########################################################

  my $input = shift;
  my $warn = 0; 
  my $htmlex; 
  use HTML::TreeBuilder;

  my $h = HTML::TreeBuilder->new;
  $h->ignore_unknown(0);
  $h->warn($warn);
  $h->parse($input);

  foreach (@ignore_tags) {
    $htmlex = 1, next if lc($_) eq "html"; # remove <html>...</html>?
    while (my $ok = $h->look_down('_tag', "$_")) { 
      $ok->replace_with_content; 
    }
  }
  foreach (@ignore_elements) {
    while (my $ok = $h->look_down('_tag', "$_")) { 
      $ok->detach; 
    }
  }

  my $output = $h->as_HTML(undef, " ", {}); # entities to encode, inde
+nt, optional endtags
  $h = $h->delete(); # nuke it!
  if ($htmlex) {
    $output =~ s:^\s*<html>::m;
    $output =~ s:</html>\s*$::m;
  }
  return $output;
}
[download]

alex pleiner <alex@zeitform.de>
zeitform Internet Dienste

Comment on Converting Word97 (or later) exported HTML to valid HTML Download Code

Replies are listed 'Best First'.
Re: Converting Word97 (or later) exported HTML to valid HTML by Corion (Patriarch) on Nov 06, 2001 at 15:50 UTC
Honestly, as I read the title of your node, HTML tidy sprang immediately to my mind, as it even has command line switches used to specifically clean up Office HTML. On that website, there is also code on how to call HTML tidy from Perl, including some proposed error checking which seems mostly geared for Unix. On the second thought, it is not really clear why they use the code they use, so I'll post it here, together with my replacement : `## This is what I think is needed beforehand : open( TIDY, "html-tidy $commandline\|") or die "Couldn't spawn html-tid +y : $!\n"; my @output; @output = <TIDY>; ## Here begins their code : if (close(TIDY) == 0) { my $exitcode = $? >> 8; if ($exitcode == 1) { printf STDERR "tidy issued warning messages\n"; } elsif ($exitcode == 2) { printf STDERR "tidy issued error messages\n"; } else { die "tidy exited with code: $exitcode\n"; } } else { printf STDERR "tidy detected no errors\n"; }` [download] I think this could simply be done with the following code, but I haven't checked all possible outcomes... `my @output = qx(html-tidy $commandline); my $exitcode = $? >> 8; if ($exitcode == 1) { printf STDERR "tidy issued warning messages\n"; } elsif ($exitcode == 2) { printf STDERR "tidy issued error messages\n"; } else { die "tidy exited with code: $exitcode\n"; }` [download] Wrapping it up, unless you tell us a really convincing reason why html-tidy is not possible (and with not possible I also mean putting html-tidy into a Perl script, writing it out to `/tmp`, starting it there and afterwards deleting the file again), I'll stick with this solution :-) `perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web` [download]	[reply] [d/l] [select]
Re: Converting Word97 (or later) exported HTML to valid HTML by jmcnamara (Monsignor) on Nov 06, 2001 at 15:52 UTC
The demoroniser might help. Update: Here is the version that was updated by Larry Rosler and TomC. -- John.	[reply]
Re: Converting Word97 (or later) exported HTML to valid HTML by projekt21 (Friar) on Nov 06, 2001 at 16:50 UTC
Thanks for the reply. I've checked all of those, but: demoronizer removes the biggest horrors but leaves some left (e.g. <b><i> ... </b></i>. Maybe I can change the code. tidy is the tool of choice (under normal conditions). As I mentioned in CB, the script/website runs on a provider's server where I am not allowed to install software (poor customer's choice). Anyway, I need to drop all CSS stuff, which requires to post-parse tidy's output. wvHtml looks interesting, too. I may implement a doc file upload. Anyway, both restrictions mentioned before (no install of software, no CSS stuff) apply here, too. Thanks for your comments and wisdom, I'll will have a sleep about this (or two) before I go on. alex pleiner <alex@zeitform.de> zeitform Internet Dienste	[reply]
Re: Re: Converting Word97 (or later) exported HTML to valid HTML by hatter (Pilgrim) on Nov 06, 2001 at 18:07 UTC
If you can run CGIs, chances are you can upload precompiled binarie or, compile your own binaries on their server from CGIs and then call them from other scripts. Unless they need to approve scripts and they then put hem live - in which case, obfuscate anything and see if they put it live when they don't understand it. /msg me if you want some more specific hints on doing things on shared servers that the admin thought they could stop. the hatter	[reply]
Re: Converting Word97 (or later) exported HTML to valid HTML by jeroenes (Priest) on Nov 06, 2001 at 16:11 UTC
It is a noble goal to produce nice HTML from the stuff that word spits out. Noble, but difficult. There is a tool for that. I'm browsing now to find that tool... . here it is: 'mswordview'. Let me download and try.... oh, new projectpage here. Looks nice, there should be HTML 4.0, LaTeX, plain text, PS, PDF output.... compiling/testing (oh you only need wv, skip the libwv) On a glance the output is decent HTML. The authors claims W3C HTML 4.0 compliance. Methinks that 'wordview' is the way to go. Jeroen	[reply]
Re: Converting Word97 (or later) exported HTML to valid HTML by andye (Curate) on Nov 06, 2001 at 18:00 UTC
You're so right - it's really quite horrendous. I've used two solutions for this in the past (neither Perl though, sorry) : Microsoft themselves have released a utility to do this - presumably available from their website Macromedia Dreamweaver has a specific function to do this The second of these obviously can't be incorporated in a script, the first probably can't, but perhaps you could persuade your users to run their html files through the Microsoft utility, on their Windows desktop? hth a little, andy.	[reply]
Re: Re: Converting Word97 (or later) exported HTML to valid HTML by impossiblerobot (Deacon) on Nov 06, 2001 at 20:44 UTC
I've found a Word filter from Microsoft that is supposed to output cleaner HTML. (I assume this is what you were talking about.) I also tend to use Dreamweaver for this task, but it does leave some of the CSS stuff behind, so some cleanup is still required. Update: Although I still haven't tested the output, it appears that the MS Word filter can be used from the command line, as a standalone GUI application, or from within Word, and can batch process multiple files. Impossible Robot	[reply]
Re: Converting Word97 (or later) exported HTML to valid HTML by Hero Zzyzzx (Curate) on Nov 06, 2001 at 21:36 UTC
I do this with a file upload and wp2html, which creates really lean HTML and has the added bonus of working with WordPerfect docs too. I'm really happy with this solution- it's fast as heck, the HTML is pretty good and you have mucho control over the generated HTML. While you can get the source, there is a 5 pound licensing fee. (very reasonable, considering the amount of work that must have gone into this). The author is very responsive, too. I've tried wvHTML too, I like wp2html better because it keeps the intent of the document, and a good amount of the formatting without trying to stay TOO true to the original format of the document. Basically, wp2html gets the good stuff, while wvHTML jumps through too many hoops to keep the converted document looking like the original Word doc. If you can upload a compiled binary, I highly suggest you check it out. It rocks! -Any sufficiently advanced technology is indistinguishable from doubletalk.	[reply]


We don't bite newbies here... much
	PerlMonks