Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Fixing Bad HTML

by Cody Pendant (Prior)
on Nov 16, 2002 at 23:17 UTC ( [id://213469]=perlquestion: print w/replies, xml ) Need Help??

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I do some work on a board where some HTML is allowed, for instance bold and italic and font colours.

Every so often someone posts some HTML which is missing the end tag, and then, you know the whole rest of the page is pink or something.

So I and another coder want to make some kind of script to count opening and closing tags and add the missing ones where necessary.

It's more or less a theoretical question here, because the board is written in PHP, but, what structure would monks use to try and figure out that the post had one or more opening tags which were unclosed, and to try to close them in the correct order?

Hashes. arrays, AoH, HoA? -- I'm not seeing any neat data structure in my head.
--

($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;

Replies are listed 'Best First'.
Re: Fixing Bad HTML
by Chmrr (Vicar) on Nov 17, 2002 at 03:07 UTC

    HTML::TreeBuilder does a good job of finding and closing such problems when it parses, as well as adding some implicit tags that get forgotten. The following line-liner should be enough to get you started:

    perl -MHTML::TreeBuilder -ne 'print map {ref $_ ? $_->as_HTML : $_} HTML::TreeBuilder->new_from_content($_) ->look_down(_tag=>"body")->content_list'

    perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

Re: Fixing Bad HTML
by chromatic (Archbishop) on Nov 16, 2002 at 23:49 UTC
Re: Fixing Bad HTML
by pg (Canon) on Nov 17, 2002 at 00:02 UTC
    Use stack. When you see an open tag, push it on to the stack, see a close tag, compare it with the last element in the stack, match than pop it out, otherwise deal with the error. If the tag is self-closed, either don't push it, or push then pop, depends on the way you treat the content.
      Thanks for that. That's a structure at least. But what if the thing to be closed isn't the last item in the stack, like if someone's crossed over tags:
      blah blah <B>blah blah<I> blah blah</B></I>
      which is bad HTML, but not problematic in this context?
      --
      ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;

        Either do as saouq sez and just don't create a mis-feature or... jump right in and do the beastly thing yourself (you do no one favors by enabling bad behaviour). If I were to actually do this you could also consider keeping track of how many tags have been opened and be sure to close them before ending your user-accessible section.

        __SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;
        which is bad HTML, but not problematic in this context?

        Opt for being strict. Disallow such crappy markup.

        -sauoq
        "My two cents aren't worth a dime.";
        
        Though I'd be inclined to disallow sloppy markup like this (as others have suggested), one option I've used in the past is to backtrack up the stack looking for a matching tag and autoclosing any open tags I pass along the way.

        In this case that would proceed something like this. You get to the </B> and look at the tag at the top of the stack. It's not a <B>, it's an <I>, so you generate a </I> yourself and pop that off the stack, then try again. This time it is a <B> so you can just pop it off the top and you move on.

        The next closing tag is </I>. Since there's no matching open tag on the stack, you simply remove it.

                $perlmonks{seattlejohn} = 'John Clyman';

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://213469]
Approved by Aristotle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (2)
As of 2024-04-25 22:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found