http://qs321.pair.com?node_id=652992


in reply to Simplify HTML programatically

Many people here advocate a strategy that is void in the original question, ie to decide what tags to remove. The original question wanted a techique to remove ALL tags with some exceptions. I found this page while looking for a way to accomplish exactly the same thing.

Ive read many warnings against parsing html with regular expressions but for this task, are they still valid?

My predicament is that im changing wysiwyg editor on an old cms system to fck. However, for some of the messy old meterial i need to clean everything except some tags (the old editor used a strategy of "hugging the text" so there is ALOT of unknown stuff on hundereds and hundereds of pages).

Problem is, my brain is too small to formulate the regular expression needed.

/nic_tester

Replies are listed 'Best First'.
Re^2: Simplify HTML programatically
by wfsp (Abbot) on Nov 26, 2007 at 15:37 UTC
    Ive read many warnings against parsing html with regular expressions but for this task, are they still valid?
    You can parse html with a regex but, imo, its tricky. I always reach for a parser. There are many and monks recommend different modules. fwi I tend to stick to HTML::TokeParser::Simple.

    Perhaps something like this (it even has a regex):

    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html_in = do{local $/;<DATA>}; my $p = HTML::TokeParser::Simple->new(\$html_in) or die qq{cant parse +html\n}; my $html_out; my $re = qr/html|head|title|body|p|img/; while (my $t = $p->get_token){ if (not $t->is_tag()){ $html_out .= $t->as_is; } elsif ($t->is_tag($re)){ $html_out .= $t->as_is; } } print qq{$html_out\n}; __DATA__ <html> <head> <title>title</title> </head> <body> <p>one <b>two</b> <i>three</i></p> <p><img src="four.gif" alt="img"> <a href="five.html">five</a></p> <p><font>six</font></p> </body> </html>
    output:
    <html> <head> <title>title</title> </head> <body> <p>one two three</p> <p><img src="four.gif" alt="img"> five</p> <p>six</p> </body> </html>
    Post a new question if this isn't what you meant or if you want more information.

      Many thanx for your reply.

      Well, the snippet you posted is more or less what I need. However, I dont know perl and it looks more serverside than clientside. My need:

      I need to clean all html-tags from a string with some exception-tags. I can only define the exceptions, not the tags to clean. This must be accomplished clientside, preferably javascript, possibly javascript dom even thou then ill be well out of my depth.

      Background (read it or not, its verbose): entering windows vista the active-x wysiwyg html editor on a content management system im hosting stopped working. My job was to integrate the fck instead of the activex.

      The old component never removed copy-pasted tags from word etc, it just "hugged the text" with its own fonttags, thus hiding loads of garbage thats still in the db. fck cannot support these css fonttags in sufficiently userfriendly manner so, in the future im using h1,h2,h3 and such together with css. However, when a user wishes to edit stuff that was produced with the old editor i fret that there might be inconsistences between h1 (fck) and font class=r1(old wysiwyg).

      And, further, if I exchange normal text markup in old editor(font class="f1") with that of fck(nothing) all the junk that has been copy-pasted into the cms system and then been hidden by hug-the-text-fonttags will suddenly surface. Thus, i want to nuke everything, on usercommand, except for stuff like links, linebreaks, tables, images paragraphtags etcetera.

      /nic
        The snippet could be written to run on the client or the server.

        Javascript: I can't help you there I'm afraid. Are you saying you want to do this in the browser?

        If what you want to do is rewrite a lot of HTML I think we'll be able to help.

        Your best bet to post a new question with a representative (but fairly short) example of your 'junk' and an example of what you want it to look like.

        Just to clearify, its the tags themselves i want to remove, not the text they enclose.