http://qs321.pair.com?node_id=653006


in reply to Re: Simplify HTML programatically
in thread Simplify HTML programatically

Ive read many warnings against parsing html with regular expressions but for this task, are they still valid?
You can parse html with a regex but, imo, its tricky. I always reach for a parser. There are many and monks recommend different modules. fwi I tend to stick to HTML::TokeParser::Simple.

Perhaps something like this (it even has a regex):

#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html_in = do{local $/;<DATA>}; my $p = HTML::TokeParser::Simple->new(\$html_in) or die qq{cant parse +html\n}; my $html_out; my $re = qr/html|head|title|body|p|img/; while (my $t = $p->get_token){ if (not $t->is_tag()){ $html_out .= $t->as_is; } elsif ($t->is_tag($re)){ $html_out .= $t->as_is; } } print qq{$html_out\n}; __DATA__ <html> <head> <title>title</title> </head> <body> <p>one <b>two</b> <i>three</i></p> <p><img src="four.gif" alt="img"> <a href="five.html">five</a></p> <p><font>six</font></p> </body> </html>
output:
<html> <head> <title>title</title> </head> <body> <p>one two three</p> <p><img src="four.gif" alt="img"> five</p> <p>six</p> </body> </html>
Post a new question if this isn't what you meant or if you want more information.

Replies are listed 'Best First'.
Re^3: Simplify HTML programatically
by Anonymous Monk on Nov 27, 2007 at 14:08 UTC

    Many thanx for your reply.

    Well, the snippet you posted is more or less what I need. However, I dont know perl and it looks more serverside than clientside. My need:

    I need to clean all html-tags from a string with some exception-tags. I can only define the exceptions, not the tags to clean. This must be accomplished clientside, preferably javascript, possibly javascript dom even thou then ill be well out of my depth.

    Background (read it or not, its verbose): entering windows vista the active-x wysiwyg html editor on a content management system im hosting stopped working. My job was to integrate the fck instead of the activex.

    The old component never removed copy-pasted tags from word etc, it just "hugged the text" with its own fonttags, thus hiding loads of garbage thats still in the db. fck cannot support these css fonttags in sufficiently userfriendly manner so, in the future im using h1,h2,h3 and such together with css. However, when a user wishes to edit stuff that was produced with the old editor i fret that there might be inconsistences between h1 (fck) and font class=r1(old wysiwyg).

    And, further, if I exchange normal text markup in old editor(font class="f1") with that of fck(nothing) all the junk that has been copy-pasted into the cms system and then been hidden by hug-the-text-fonttags will suddenly surface. Thus, i want to nuke everything, on usercommand, except for stuff like links, linebreaks, tables, images paragraphtags etcetera.

    /nic
      The snippet could be written to run on the client or the server.

      Javascript: I can't help you there I'm afraid. Are you saying you want to do this in the browser?

      If what you want to do is rewrite a lot of HTML I think we'll be able to help.

      Your best bet to post a new question with a representative (but fairly short) example of your 'junk' and an example of what you want it to look like.

        Ok, will do, thanx for info. Problem with the junk is i havent got a clue what it is. Its hundereds of users with next to no knowledge that has been at it for years putting up html pages anywhichway they could ram it through the old wysiwyg component, so, i expect the "junk" is anything and everything.
      Just to clearify, its the tags themselves i want to remove, not the text they enclose.