http://qs321.pair.com?node_id=217959

starlight has asked for the wisdom of the Perl Monks concerning the following question:

Hi everybody,
My question concerns removing the dreadful HTML tags created by Microsoft Word's "Save as HTML..." feature.

(I know, I know... Nevermind why I have to deal with it in the first place.)

Do there exist any freely available scripts or modules to clean up this mess? Perhaps using Html-Parser and/or Html-Tagset?

Replies are listed 'Best First'.
Re: How to clean-up Microsoft Word HTML
by impossiblerobot (Deacon) on Dec 06, 2002 at 02:56 UTC
Re: How to clean-up Microsoft Word HTML
by pfaut (Priest) on Dec 06, 2002 at 01:53 UTC

    Would this help? I've never used it successfully myself (the documents I tried to fix might not have suffered from the problems this tool addresses) and I don't know exactly what problems you're trying to solve.

      Here's my solution to make pretty (hand editable) HTML ,set $dos for cr/lf and $nostyle to remove all style information:

      #!/usr/bin/perl $nostyle=1; $dos=1; while(<>){$text.=$_;} $text=~s/content="Microsoft Word \d+"/content="wordclean.pl"/g; $text=~s/(\r|\n)+/ /g; $text=~s/<\/?o:.+?>//g; $text=~s/<!--.+-->//g; $text=~s/xmlns(:.+?)?=".+?"//g; $text=~s/mso-.+?:\s?.+?'/'/g; $text=~s/mso-.+?:\s?.+?;//g; $text=~s/style=''//g; $text=~s#style='.+?'##g if ($nostyle); $text=~s/<link rel=File-List href=".+?">//g; $text=~s/class=\w+//g; $text=~s/<\/?st1:\w+>//g; $text=~s/\s+>/>/g; $text=~s/>\s+</></g; $text=~s/\s+/ /g; $text=~s#</?span>##g if ($nostyle); $text=~s#<span style='font-size:12.0pt;\s?'>(.+?)</span>#$1#g; $text=~s#<span[^>]*>\s*</span>##g; $text=~s#<span>(.+)</span>#$1#g; $text=~s/(<\w.+?>)/\n$1/g; $text=~s/\n<b>/<b>/g; $text=~s#</(html|body|head|tr|td|table|div)>#\n</$1>#g; $text=~s#\n<html>#<html>#; $text=~s#\n#\r\n#g if ($dos); print $text;
Re: How to clean-up Microsoft Word HTML
by reclaw (Curate) on Dec 06, 2002 at 02:52 UTC

      ++Reclaw

      Cleaning up Word HTML is actually the exact purpose for which Tidy was created. It started as a W3C project, or at least was hosted there for a time. I understand it's an excellent piece of software though I have only tinkered with it because I write my HTML in Notepad. *grin*