Simplify HTML programatically

guha has asked for the wisdom of the Perl Monks concerning the following question:

I have a webapplication where the users can add information to a textArea. To make it a bit more spiffy I added a JavaScript application which is a sort of crippled HTML editor. This is all fine and dandy, but then I noticed that the users were cut&pasting from Word directly. This has two drawbacks.

* The HTML from is extremely wordy, heh, a rather simple table can generate megabytes of HTML

* The HTML is too complex for the javascript application.

A possible solution to this problem would be if I could somehow simplify the pasted HTML when saving it to the DB.

Something like take this very complex HTML and allow only tables, bold, italic, h1,h2, plain text, ... and remove everything else. This would change the WYSIWYG but not the content.

I have tried to, using regexps, remove certain specific Word-HTML constructs. This has the advantage of decreasing the size of the resulting HTML but seems to destabilize stuff as well.

Ideas, pointers to CPAN, anyone ?

Comment on Simplify HTML programatically

Replies are listed 'Best First'.
Re: Simplify HTML programatically by ww (Archbishop) on Jun 07, 2006 at 21:16 UTC
Demoronizer, tidy, and various other aps have been discussed here. Scott7477 is just beginning work on a tut on this general topic (and I'm allegedly helping, though in fact, planetscape has -- IMO -- done most of that by pulling together a formidable collection of refs, cites, etc.) while I may be of little value other than word-butchery. GrandFather has posted a WYSIWYG editor in CUFP which may be relevant and helpful. I could undoubtedly name many other relevant resources... BUT... *Were I in your shoes, I would not allow "cut&pasting" of M$word (so-called)* .html under any circumstances... for the very good reasons you outline... and perhaps even more important**, would not allow use of any other than a very small subset of html tags for a whole range of taste-, security- and simplicity-reasons and just plain "old-fogey-ism." On the other hand, the Monastery does provide minimally restrictive methods for a visitor to write some .html. Have you explored those methods?	[reply]
Re^2: Simplify HTML programatically by planetscape (Chancellor) on Jun 09, 2006 at 02:54 UTC
FWIW, my collection of links etc. may be found at the top of planetscape's extra scratchpad. HTH, Update 2006-06-10: Or, better, inline here, in case I later forget and delete/move something... HTML TIDY Clean up your Web pages with HTML TIDY HTML Tidy Library Project Charlie's Tidy Add-ons, including a Perl Wrapper HTML tidy, using XML::LibXML Demoronizer Demoronizer - A perl script to sanitize Microsoft's HTML Demoronizer - Correct Moronic Microsoft HTML Word HTML 2 Formatting Objects WH2FO is a Java application that processes an HTML output, created with Word 2000, and transforms it into an XML content file and an XSL stylesheet file. From these files, a standard XSLT processor may be used to obtain a file containing only XSL-FO markup. You can also apply a stylesheet that converts the XML back into HTML discarding all the extra markup added by Word. Using an XSL-FO renderer, such as FOP, you can also render your document into PDF. PMEdit GrandFather's PerlMonks Editor - on CPAN: PMEdit-001.000104-1.pl Miscellaneous HTML::Parser has several example programs, such as hstrip.pl which might be helpful OpenOffice.org - free office suite and MSOffice alternative Use OpenOffice to save in HTML format. OpenOffice creates much cleaner HTML, and the resulting file may still be run through HTML Tidy or a script such as hstrip.pl. Word Processor Filters gxml2html planetscape	[reply]
Re: Simplify HTML programatically by davidrw (Prior) on Jun 07, 2006 at 21:11 UTC
A simple PM search for "word html" yielded these: How to clean-up Microsoft Word HTML Converting and cleaning Word's HTML export to valid HTML Converting Word97 (or later) exported HTML to valid HTML They're all a few years old, but at a quick glance looks like they could still be useful or at least point in some directions... You should try a Super Search as well.	[reply]
Re: Simplify HTML programatically by Joost (Canon) on Jun 07, 2006 at 21:37 UTC
htmltidy has specific options to clean up word-generated html. it's still not perfect, but it gets rid of the worst. Take special notice of the "word-2000" and "bare" options. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re: Simplify HTML programatically by trwww (Priest) on Jun 08, 2006 at 06:29 UTC
I use SAX for this. For me it provides the best performance/maintainability ratio. Here is a standalone program that takes a html file name as an argument and prints the output to STDOUT. Its purpose is to strip all tags except p, div, and a tags. The program driver sets up the SAX pipeline. It uses XML::Driver::HTML as the SAX driver and XML::SAX::Writer as the writer. When `parse` is called on the driver, the Pipeline sends the driver's stream to some helper modules, to our filter module, and finally the writer. In the custom filter, start_element is called for each tag. The code checks to see if the tag is either an a, div, or p tag, and if it is, it forwards the tag to the writer. Otherwise, the tag is ignored and removed from the stream. The same work needs to happen in the end_element callback. benefits: Accepts non-well-formed html and outputs well formed xml. Scales well. In theory it should use a constant amount of memory. High level of control of the output document drawbacks: need to learn SAX if you remove a node from the stream in the start_element callback you have to remove the closing tag in the end_element callback (this isn't so much of a "drawback" but more of a reminder that you have to stay on your toes). use warnings; use strict; use XML::SAX::Machines qw(Pipeline); use XML::Driver::HTML; use XML::Filter::SAX1toSAX2; use XML::Filter::BufferText; use XML::SAX::Writer; my $output; # transformation target my $writer = XML::SAX::Writer->new( Output => \$output ); my $machine = Pipeline( 'XML::Filter::SAX1toSAX2' => 'XML::Filter::BufferText' => 'XML::Filter::HtmlTagStripper' => $writer ); my $html = new XML::Driver::HTML( Handler => $machine, Source => { SystemId => $ARGV[0] } ); $html->parse(); print $output; package XML::Filter::HtmlTagStripper; use base qw\|XML::SAX::Base\|; # <marker language="foo" /> # $el->{Name} == 'marker' # $el->{Attributes}{'{}language'} == language attribute # $el->{Attributes}{'{}language'}{Value} == 'foo' sub start_element { my($self, $el) = @_; if ( $el->{Name} =~ m/^(?:p\|div\|a)$/i ) { $self->SUPER::start_element( $el ); } } sub end_element { my($self, $el) = @_; if ( $el->{Name} =~ m/^(?:p\|div\|a)$/i ) { $self->SUPER::end_element( $el ); } } 1; [download] I know its not the most un-rocket-sciencey thing in the world, but its not too tricky, and once it clicks in your head and you realize that this is high performance xml parsing, the possibilities are boggling. Lets take a look at a run: `$ cat striptags.html <html> <head> <title>Test Document</title> </head> <body> <p>The first paragraph</p> <p>the second paragraph</p> <hr width="75%"> <div>last modified: WHENEVER</div> </body> </html> $ perl striptags.pl striptags.html Test Document<p>The first paragraph</p><p>the second paragraph</p><div +>last modified: WHENEVER</div>` [download] Enjoy, Todd W.	[reply] [d/l] [select]
Re^2: Simplify HTML programatically by hobbs (Monk) on Jun 09, 2006 at 05:09 UTC
For something of a simpler* solution, but in the same vein, there's HTML::TreeBuilder. HTML::Element provides all of the primitives that you really need for an operation like this: `look_down` to identify relevant elements, `replace_with_content` to "remove" a tag without removing what it contains, and `delete` to completely destroy all signs of a given element. I'm not up to writing an example right now, but it's truly simple. Give it a shot! It goes a long way, and the output is bound to be less of a mess than the input. * edit: okay, I realized that some might be confused by this usage of simple, since trwww's example is pretty simple in itself. Mostly it's a matter of being allowed to think in terms of tree manipulations instead of opens and closes and stacking and de-stacking. The corresponding cost is in storage, but it's usually not worrisome.	[reply]
Re: Simplify HTML programatically by hpavc (Acolyte) on Jun 08, 2006 at 01:15 UTC
I am afraid to do this right, you would have to nearly transform the HTML fragment that word generates when doing this. It is chop full of div's last I look that represent tons of waste and some that were integral to basic formatting.	[reply]
Re: Simplify HTML programatically by Rhandom (Curate) on Jun 08, 2006 at 15:00 UTC
Prebuilt, configurable open source rich text editors abound. Many have content limiting ability. One that is easy to customize is moxie code's tinymce. The documentation describes how to limit the output html to just those items you want. Yes - it is javascript - but if they are cutting and pasting from Word then there is a 99.9% chance that javascript is enabled. my @a=qw(random brilliant braindead); print $a[rand(@a)];	[reply]
Re: Simplify HTML programatically by DaWolf (Curate) on Jun 09, 2006 at 04:31 UTC
I have to agree with ww. Since you have "JavaScript Power" in your website it would be relatively simple to prevent "CTRL + C, CTRL + V" and other key combinations. Incidentally, have you tried FCKEditor? AFAIK it's the best HTML on-line editor out there. Just my two cents. Er Galv�o Abbott www.galvao.eti.br Porto Alegre Perl Mongers	[reply]
Re^2: Simplify HTML programatically by Anonymous Monk on Jun 09, 2006 at 21:49 UTC
agreed, FCKEditor handles WYSIWYG html editing from within a webpage quite well, and it does handle MSWord html as well.	[reply]
Re: Simplify HTML programatically by blahblah (Friar) on Jun 12, 2006 at 06:25 UTC
No one has mentioned HTML::Scrubber? It works well for me.	[reply]
Re: Simplify HTML programatically by freddo411 (Chaplain) on Jun 12, 2006 at 20:53 UTC
In a somewhat related vein, if you want to remove some of the special characters that MSword creates you can use the following code which is based on the excellent demoronizer code. sub nukeMSsmarts { my $s = shift; # Map incompatible CP-1252 characters $s =~ s/\x82/,/g; $s =~ s-\x83-<em>f</em>-g; $s =~ s/\x84/,,/g; $s =~ s/\x85/.../g; $s =~ s/\x88/^/g; $s =~ s-\x89- �/��-g; $s =~ s/\x8B/</g; $s =~ s/\x8C/Oe/g; $s =~ s/\x91/'/g; $s =~ s/\x92/'/g; $s =~ s/\x93/"/g; $s =~ s/\x94/"/g; $s =~ s/\x95//g; $s =~ s/\x96/-/g; $s =~ s/\x97/--/g; $s =~ s-\x98-<sup>~</sup>-g; $s =~ s-\x99-<sup>TM</sup>-g; $s =~ s/\x9B/>/g; $s =~ s/\x9C/oe/g; # Now check for any remaining untranslated characters. $s =~ s/[\x00-\x08\x10-\x1F\x80-\x9F]//g; return $s; } [download] ------------------------------------- Nothing is too wonderful to be true -- Michael Faraday	[reply] [d/l]
Re: Simplify HTML programatically by Anonymous Monk on Nov 26, 2007 at 14:30 UTC
Many people here advocate a strategy that is void in the original question, ie to decide what tags to remove. The original question wanted a techique to remove ALL tags with some exceptions. I found this page while looking for a way to accomplish exactly the same thing. Ive read many warnings against parsing html with regular expressions but for this task, are they still valid? My predicament is that im changing wysiwyg editor on an old cms system to fck. However, for some of the messy old meterial i need to clean everything except some tags (the old editor used a strategy of "hugging the text" so there is ALOT of unknown stuff on hundereds and hundereds of pages). Problem is, my brain is too small to formulate the regular expression needed. /nic_tester	[reply]
Re^2: Simplify HTML programatically by wfsp (Abbot) on Nov 26, 2007 at 15:37 UTC
Ive read many warnings against parsing html with regular expressions but for this task, are they still valid? You can parse html with a regex but, imo, its tricky. I always reach for a parser. There are many and monks recommend different modules. fwi I tend to stick to HTML::TokeParser::Simple. Perhaps something like this (it even has a regex): #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html_in = do{local $/;<DATA>}; my $p = HTML::TokeParser::Simple->new(\$html_in) or die qq{cant parse +html\n}; my $html_out; my $re = qr/html\|head\|title\|body\|p\|img/; while (my $t = $p->get_token){ if (not $t->is_tag()){ $html_out .= $t->as_is; } elsif ($t->is_tag($re)){ $html_out .= $t->as_is; } } print qq{$html_out\n}; __DATA__ <html> <head> <title>title</title> </head> <body> <p>one <b>two</b> <i>three</i></p> <p><img src="four.gif" alt="img"> <a href="five.html">five</a></p> <p><font>six</font></p> </body> </html> [download] output: `<html> <head> <title>title</title> </head> <body> <p>one two three</p> <p><img src="four.gif" alt="img"> five</p> <p>six</p> </body> </html>` [download] Post a new question if this isn't what you meant or if you want more information.	[reply] [d/l] [select]
Re^3: Simplify HTML programatically by Anonymous Monk on Nov 27, 2007 at 14:08 UTC
Many thanx for your reply. Well, the snippet you posted is more or less what I need. However, I dont know perl and it looks more serverside than clientside. My need: I need to clean all html-tags from a string with some exception-tags. I can only define the exceptions, not the tags to clean. This must be accomplished clientside, preferably javascript, possibly javascript dom even thou then ill be well out of my depth. Background (read it or not, its verbose): entering windows vista the active-x wysiwyg html editor on a content management system im hosting stopped working. My job was to integrate the fck instead of the activex. The old component never removed copy-pasted tags from word etc, it just "hugged the text" with its own fonttags, thus hiding loads of garbage thats still in the db. fck cannot support these css fonttags in sufficiently userfriendly manner so, in the future im using h1,h2,h3 and such together with css. However, when a user wishes to edit stuff that was produced with the old editor i fret that there might be inconsistences between h1 (fck) and font class=r1(old wysiwyg). And, further, if I exchange normal text markup in old editor(font class="f1") with that of fck(nothing) all the junk that has been copy-pasted into the cms system and then been hidden by hug-the-text-fonttags will suddenly surface. Thus, i want to nuke everything, on usercommand, except for stuff like links, linebreaks, tables, images paragraphtags etcetera. /nic	[reply]
Re^4: Simplify HTML programatically by wfsp (Abbot) on Nov 27, 2007 at 14:41 UTC
Re^5: Simplify HTML programatically by nic_tester (Initiate) on Nov 27, 2007 at 14:49 UTC
Re^4: Simplify HTML programatically by Anonymous Monk on Nov 27, 2007 at 14:36 UTC

Back to Seekers of Perl Wisdom

BUT...

HTML TIDY

Demoronizer

PMEdit

Miscellaneous