http://qs321.pair.com?node_id=554145

guha has asked for the wisdom of the Perl Monks concerning the following question:

I have a webapplication where the users can add information to a textArea. To make it a bit more spiffy I added a JavaScript application which is a sort of crippled HTML editor. This is all fine and dandy, but then I noticed that the users were cut&pasting from Word directly. This has two drawbacks.

* The HTML from is extremely wordy, heh, a rather simple table can generate megabytes of HTML

* The HTML is too complex for the javascript application.

A possible solution to this problem would be if I could somehow simplify the pasted HTML when saving it to the DB.

Something like take this very complex HTML and allow only tables, bold, italic, h1,h2, plain text, ... and remove everything else. This would change the WYSIWYG but not the content.

I have tried to, using regexps, remove certain specific Word-HTML constructs. This has the advantage of decreasing the size of the resulting HTML but seems to destabilize stuff as well.

Ideas, pointers to CPAN, anyone ?

Replies are listed 'Best First'.
Re: Simplify HTML programatically
by ww (Archbishop) on Jun 07, 2006 at 21:16 UTC
    Demoronizer, tidy, and various other aps have been discussed here.

    Scott7477 is just beginning work on a tut on this general topic (and I'm allegedly helping, though in fact, planetscape has -- IMO -- done most of that by pulling together a formidable collection of refs, cites, etc.) while I may be of little value other than word-butchery.

    GrandFather has posted a WYSIWYG editor in CUFP which may be relevant and helpful. I could undoubtedly name many other relevant resources...

    BUT...

    Were I in your shoes, I would not allow "cut&pasting" of M$word (so-called) .html under any circumstances... for the very good reasons you outline... and
    perhaps even more important, would not allow use of any other than a very small subset of html tags for a whole range of taste-, security- and simplicity-reasons and just plain "old-fogey-ism."

    On the other hand, the Monastery does provide minimally restrictive methods for a visitor to write some .html. Have you explored those methods?

Re: Simplify HTML programatically
by davidrw (Prior) on Jun 07, 2006 at 21:11 UTC
Re: Simplify HTML programatically
by Joost (Canon) on Jun 07, 2006 at 21:37 UTC
Re: Simplify HTML programatically
by trwww (Priest) on Jun 08, 2006 at 06:29 UTC

    I use SAX for this. For me it provides the best performance/maintainability ratio.

    Here is a standalone program that takes a html file name as an argument and prints the output to STDOUT.

    Its purpose is to strip all tags except p, div, and a tags.

    The program driver sets up the SAX pipeline. It uses XML::Driver::HTML as the SAX driver and XML::SAX::Writer as the writer.

    When parse is called on the driver, the Pipeline sends the driver's stream to some helper modules, to our filter module, and finally the writer.

    In the custom filter, start_element is called for each tag. The code checks to see if the tag is either an a, div, or p tag, and if it is, it forwards the tag to the writer. Otherwise, the tag is ignored and removed from the stream.

    The same work needs to happen in the end_element callback.

    • benefits:
      • Accepts non-well-formed html and outputs well formed xml.
      • Scales well. In theory it should use a constant amount of memory.
      • High level of control of the output document
    • drawbacks:
      • need to learn SAX
      • if you remove a node from the stream in the start_element callback you have to remove the closing tag in the end_element callback (this isn't so much of a "drawback" but more of a reminder that you have to stay on your toes).
    use warnings; use strict; use XML::SAX::Machines qw(Pipeline); use XML::Driver::HTML; use XML::Filter::SAX1toSAX2; use XML::Filter::BufferText; use XML::SAX::Writer; my $output; # transformation target my $writer = XML::SAX::Writer->new( Output => \$output ); my $machine = Pipeline( 'XML::Filter::SAX1toSAX2' => 'XML::Filter::BufferText' => 'XML::Filter::HtmlTagStripper' => $writer ); my $html = new XML::Driver::HTML( Handler => $machine, Source => { SystemId => $ARGV[0] } ); $html->parse(); print $output; package XML::Filter::HtmlTagStripper; use base qw|XML::SAX::Base|; # <marker language="foo" /> # $el->{Name} == 'marker' # $el->{Attributes}{'{}language'} == language attribute # $el->{Attributes}{'{}language'}{Value} == 'foo' sub start_element { my($self, $el) = @_; if ( $el->{Name} =~ m/^(?:p|div|a)$/i ) { $self->SUPER::start_element( $el ); } } sub end_element { my($self, $el) = @_; if ( $el->{Name} =~ m/^(?:p|div|a)$/i ) { $self->SUPER::end_element( $el ); } } 1;

    I know its not the most un-rocket-sciencey thing in the world, but its not too tricky, and once it clicks in your head and you realize that this is high performance xml parsing, the possibilities are boggling.

    Lets take a look at a run:

    $ cat striptags.html <html> <head> <title>Test Document</title> </head> <body> <p>The first paragraph</p> <p>the second paragraph</p> <hr width="75%"> <div>last modified: WHENEVER</div> </body> </html> $ perl striptags.pl striptags.html Test Document<p>The first paragraph</p><p>the second paragraph</p><div +>last modified: WHENEVER</div>
    Enjoy, Todd W.
      For something of a simpler* solution, but in the same vein, there's HTML::TreeBuilder. HTML::Element provides all of the primitives that you really need for an operation like this: look_down to identify relevant elements, replace_with_content to "remove" a tag without removing what it contains, and delete to completely destroy all signs of a given element. I'm not up to writing an example right now, but it's truly simple. Give it a shot! It goes a long way, and the output is bound to be less of a mess than the input.

      * edit: okay, I realized that some might be confused by this usage of simple, since trwww's example is pretty simple in itself. Mostly it's a matter of being allowed to think in terms of tree manipulations instead of opens and closes and stacking and de-stacking. The corresponding cost is in storage, but it's usually not worrisome.

Re: Simplify HTML programatically
by hpavc (Acolyte) on Jun 08, 2006 at 01:15 UTC
    I am afraid to do this right, you would have to nearly transform the HTML fragment that word generates when doing this. It is chop full of div's last I look that represent tons of waste and some that were integral to basic formatting.
Re: Simplify HTML programatically
by Rhandom (Curate) on Jun 08, 2006 at 15:00 UTC
    Prebuilt, configurable open source rich text editors abound. Many have content limiting ability. One that is easy to customize is moxie code's tinymce. The documentation describes how to limit the output html to just those items you want.

    Yes - it is javascript - but if they are cutting and pasting from Word then there is a 99.9% chance that javascript is enabled.
    my @a=qw(random brilliant braindead); print $a[rand(@a)];
Re: Simplify HTML programatically
by DaWolf (Curate) on Jun 09, 2006 at 04:31 UTC
    I have to agree with ww. Since you have "JavaScript Power" in your website it would be relatively simple to prevent "CTRL + C, CTRL + V" and other key combinations.

    Incidentally, have you tried FCKEditor? AFAIK it's the best HTML on-line editor out there.

    Just my two cents.
      agreed, FCKEditor handles WYSIWYG html editing from within a webpage quite well, and it does handle MSWord html as well.
Re: Simplify HTML programatically
by blahblah (Friar) on Jun 12, 2006 at 06:25 UTC
Re: Simplify HTML programatically
by freddo411 (Chaplain) on Jun 12, 2006 at 20:53 UTC
    In a somewhat related vein, if you want to remove some of the special characters that MSword creates you can use the following code which is based on the excellent demoronizer code.
    sub nukeMSsmarts { my $s = shift; # Map incompatible CP-1252 characters $s =~ s/\x82/,/g; $s =~ s-\x83-<em>f</em>-g; $s =~ s/\x84/,,/g; $s =~ s/\x85/.../g; $s =~ s/\x88/^/g; $s =~ s-\x89- °/°°-g; $s =~ s/\x8B/</g; $s =~ s/\x8C/Oe/g; $s =~ s/\x91/'/g; $s =~ s/\x92/'/g; $s =~ s/\x93/"/g; $s =~ s/\x94/"/g; $s =~ s/\x95/*/g; $s =~ s/\x96/-/g; $s =~ s/\x97/--/g; $s =~ s-\x98-<sup>~</sup>-g; $s =~ s-\x99-<sup>TM</sup>-g; $s =~ s/\x9B/>/g; $s =~ s/\x9C/oe/g; # Now check for any remaining untranslated characters. $s =~ s/[\x00-\x08\x10-\x1F\x80-\x9F]/*/g; return $s; }

    -------------------------------------
    Nothing is too wonderful to be true
    -- Michael Faraday

Re: Simplify HTML programatically
by Anonymous Monk on Nov 26, 2007 at 14:30 UTC

    Many people here advocate a strategy that is void in the original question, ie to decide what tags to remove. The original question wanted a techique to remove ALL tags with some exceptions. I found this page while looking for a way to accomplish exactly the same thing.

    Ive read many warnings against parsing html with regular expressions but for this task, are they still valid?

    My predicament is that im changing wysiwyg editor on an old cms system to fck. However, for some of the messy old meterial i need to clean everything except some tags (the old editor used a strategy of "hugging the text" so there is ALOT of unknown stuff on hundereds and hundereds of pages).

    Problem is, my brain is too small to formulate the regular expression needed.

    /nic_tester
      Ive read many warnings against parsing html with regular expressions but for this task, are they still valid?
      You can parse html with a regex but, imo, its tricky. I always reach for a parser. There are many and monks recommend different modules. fwi I tend to stick to HTML::TokeParser::Simple.

      Perhaps something like this (it even has a regex):

      #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html_in = do{local $/;<DATA>}; my $p = HTML::TokeParser::Simple->new(\$html_in) or die qq{cant parse +html\n}; my $html_out; my $re = qr/html|head|title|body|p|img/; while (my $t = $p->get_token){ if (not $t->is_tag()){ $html_out .= $t->as_is; } elsif ($t->is_tag($re)){ $html_out .= $t->as_is; } } print qq{$html_out\n}; __DATA__ <html> <head> <title>title</title> </head> <body> <p>one <b>two</b> <i>three</i></p> <p><img src="four.gif" alt="img"> <a href="five.html">five</a></p> <p><font>six</font></p> </body> </html>
      output:
      <html> <head> <title>title</title> </head> <body> <p>one two three</p> <p><img src="four.gif" alt="img"> five</p> <p>six</p> </body> </html>
      Post a new question if this isn't what you meant or if you want more information.

        Many thanx for your reply.

        Well, the snippet you posted is more or less what I need. However, I dont know perl and it looks more serverside than clientside. My need:

        I need to clean all html-tags from a string with some exception-tags. I can only define the exceptions, not the tags to clean. This must be accomplished clientside, preferably javascript, possibly javascript dom even thou then ill be well out of my depth.

        Background (read it or not, its verbose): entering windows vista the active-x wysiwyg html editor on a content management system im hosting stopped working. My job was to integrate the fck instead of the activex.

        The old component never removed copy-pasted tags from word etc, it just "hugged the text" with its own fonttags, thus hiding loads of garbage thats still in the db. fck cannot support these css fonttags in sufficiently userfriendly manner so, in the future im using h1,h2,h3 and such together with css. However, when a user wishes to edit stuff that was produced with the old editor i fret that there might be inconsistences between h1 (fck) and font class=r1(old wysiwyg).

        And, further, if I exchange normal text markup in old editor(font class="f1") with that of fck(nothing) all the junk that has been copy-pasted into the cms system and then been hidden by hug-the-text-fonttags will suddenly surface. Thus, i want to nuke everything, on usercommand, except for stuff like links, linebreaks, tables, images paragraphtags etcetera.

        /nic