Word HTML issues

mlhmich has asked for the wisdom of the Perl Monks concerning the following question:

I am using HTMLarea3 and I have my users pasting word HTML in to it. Sometimes it looks on in htmlarea sometimes it does not. Either way when it gets posted to a site, it looks VERY bad. In dreamweaver there is a "Clean up word HTML" option. Is there any way to do something like that in perl, with regex. I am not very good with regex's but has someone done something like this?

Thanks for all the monks help,
Mike

Update: How does that script work with all the # in it?

20050515 Edit by ysth: restore original question

20050516 Edit by Corion: Unconsidered. Was considered: Animator: retitle: issues with HTML generated by (Microsoft) Word (Keep/Edit/Delete: 7/10/1)

Comment on Word HTML issues

Replies are listed 'Best First'.
Re: Word HTML issues by Corion (Patriarch) on May 15, 2005 at 19:57 UTC
Although I haven't used it, there is the Demoronizer, which purports to clean up the HTML generated by Word. I'm not sure whether it will help you. You could also disallow pasting Word stuff, because I'm not sure how HTMLArea3 handles pasted Word documents, as it doesn't have access to the special Word formatting. You could consider having your users paste or upload RTF, and then convert the RTF to proper HTML.	[reply]
Re^2: Word HTML issues by ww (Archbishop) on May 15, 2005 at 22:05 UTC
Unfortunately, Demoronizer worked better on the html generated by the version M$Word which was current when Demoronizer (Oh, I love that name) was written than it does on the output from more recent Word versions; the newer ones use all manner of new and sometimes unpleasant, non-standard html (or, more recently, XML, which also tends to be unpleasant to try to convert). Corion's advice to have your users to provide RTF (or even, plain text) for conversion should work better than (the latest version I've found) of Demoronizer... and I even took at whack at updating it to deal with additional versions of what Word claims is .html. However, I see other recommendations for cleanup below... and I, for one, am going to check them out. You may find them valuable (and easier) than either Demoronizer or than learning enough (standards complaint) .html to convert .txt or .rtf.	[reply]
Re: Word HTML issues by davidrw (Prior) on May 15, 2005 at 20:03 UTC
Yuck. Below is a script that i used recently to script out a lot of word-generated junk .. caveat emptor--it was a quick & dirty solution for my specific files. But some of the regex's maybe of some use. Note that it gets rid of everything inbetween `<!...>` tags, and also pretty much strips also style junk with `mso` in it. Read more... (1501 Bytes) As for a more generic approach, I haven't used one, but a quick cpan search or HTML yields HTML::Scrubber and HTML::Sanitizer which (at a 2-s glance) look promising.	[reply] [d/l] [select]
Re^2: Word HTML issues by Animator (Hermit) on May 15, 2005 at 20:57 UTC
The OP changed his question into: How does that script work with all the # in it? I assume this was ment to be a reply to this post, so I'll post my answer here. The # in `s#class=section#class="Section"#sg;` for example is used as a regex-delimiter, not as a comment. It is the same as `s/class=section/class="Section"/sg;` except that with s/// you would need to escape the / (which could make it less readable, but that does not apply in this case since it has no / in the regex)	[reply] [d/l] [select]
Re: Word HTML issues by astroboy (Chaplain) on May 15, 2005 at 20:11 UTC
The problem with HTMLArea is that the creators have stopped maintaining and supporting it. Have a look at FCKeditor , one of the most popular Sourceforge projects. It has "Paste from Word cleanup with auto detection"	[reply]
Re: Word HTML issues by davidrw (Prior) on May 15, 2005 at 21:07 UTC
How does that script work with all the # in it? First, please don't replace your original content like that (i think the editors are going to fix it)--just reply to the replies instead.. Anyways, i think you're asking about my usage of things like `s#foo#stuff#` Because of how the operators (see perlop and perlre man pages), all of these are do the same thing: `s/foo/stuff/ s#foo#stuff# s!foo!stuff! s?foo?stuff?` [download] In this case, i used s### instead of s/// for two reasons: The # is pretty legible since it's visually a block. Since i'm dealing w/html tags, i don't have to worry about escaping /'s. For example, these two are identical, but one is obvisouly easier to read & write: `s/<tr><td>.?<\/td><\/tr>/FOO/; s#<tr><td>.?</td></tr>#FOO#;` [download]	[reply] [d/l] [select]


go ahead... be a heretic
	PerlMonks