Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Ensuring HTML is "balanced"

by skx (Parson)
on Mar 06, 2006 at 20:22 UTC ( [id://534748]=perlquestion: print w/replies, xml ) Need Help??

skx has asked for the wisdom of the Perl Monks concerning the following question:

I'm interested in modifying user submitted HTML, such that all tags are balanced.

eg "<b><i>test</b>" is obviously broken HTML.

I realise I can do simple cases with regexps, but to do it properly I probably want to use HTML::Treebuilder, or similar.

The problem is I'm not 100% sure how to start. I can certainly keep a stack of opened tags, and know when something is broken. But pushing the closures on in the right order is a bit tricky.

Suprisingly CPAN didn't seem to have anything to offer when I searched for terms such as 'html balance', so if there is existing code I've not found it.

Steve
--

Replies are listed 'Best First'.
Re: Ensuring HTML is "balanced"
by GrandFather (Saint) on Mar 06, 2006 at 21:06 UTC

    HTML::Treebuilder is a good answer. It is pretty tolerant of missing close tags and can generate nice HTML output if you ask it nicely. You may also be interested in HTML::Lint which parses HTML and generates an error report.

    use strict; use warnings; use HTML::TreeBuilder; use HTML::Lint; my $html = do {local $/; (<DATA>)}; my $lint = HTML::Lint->new (only_types => HTML::Lint::Error::STRUCTURE +); $lint->parse ($html); $lint->eof (); print "HTML::Lint report:\n"; print join "\n", map {$_->as_string ()} $lint->errors (); my $tree = HTML::TreeBuilder->new (); $tree->parse ($html); $tree->eof (); print "\n\nTreeBuilder cleaned up HTML\n"; print $tree->as_HTML (); __DATA__ <p><b><i>test</b></p>

    Prints:

    HTML::Lint report: (1:14) <i> at (1:7) is never closed (1:18) <body> tag is required (1:18) <head> tag is required (1:18) <html> tag is required (1:18) <title> tag is required TreeBuilder cleaned up HTML <html><head></head><body><p><b><i>test</i></b></body></html>

    DWIM is Perl's answer to Gödel
      ...So since the cleaned up HTML in fact has a broken p tag, are we free to assume that the Lint report *and* Treebuilder handle P tags in an amusing manner?

        Actually HTML doesn't require that some tags (including p tags) be closed. In particular the HTML 4.01 specification in section 9.3.1 says:

        Paragraphs: the P element
        Start tag: required, End tag: optional

        so strictly speaking the p tag is not broken.


        Perl is environmentally friendly - it saves trees
        This is a common misconception (and one which I think reflects what the standard *SHOULD* be). Even though the close </p> tag is mandatory for certain other standards, </p> is optional in html 4.01, per http://www.w3.org/TR/html401/struct/text.htm and other w3c references:
        9.3.1 Paragraphs: the P element
        ...
        Start tag: required, End tag: optional

        Whether or not this stands in the forthcoming html 5.0 standard is unknown.

Re: Ensuring HTML is "balanced"
by webfiend (Vicar) on Mar 07, 2006 at 00:20 UTC

    If your solution doesn't absolutely, positively have to be in Perl, then maybe a call to HTML Tidy would be the appropriate solution. It takes broken input and does its best to send it back to you as clean HTML. I use it in a lot of my projects.

    Another solution I use is to not allow HTML formatting in user input, but maybe you don't want to force your users into learning one of the various HTML shorthand languages. Still, using Text::Textile to generate HTML from user input might be a little easier than making sure your users are always creating correct markup on their own.

Re: Ensuring HTML is "balanced"
by spiritway (Vicar) on Mar 06, 2006 at 20:37 UTC

    I'm not sure if this is what you want, but, using search terms "HTML tags" I found a couple of possibilities: HTML::TagUtil, and HTML::EasyTags. While they may not do exactly what you want, they may provide information about the tags for you to make the desired corrections to your input.

    The problem, of couse, is that it's often not possible to know exactly where the tags were intended. For example: <b>Which <i>exact text was supposed to be italics?</b>. Where does the </i> tag go? In more complex text, it is likely that only the original author knows just where the tags should have been placed - if, in fact, s/he even knows.

      It's true that knowing where the tag should be closed can be a tricky guesstimation at best in many circumstances. There's a simple way to decide where to put the closing tag when in doubt, though: just stick it in the last possible place to have it nest properly. Thus, in your example, the </i> would be placed just before the </b>, like so:

      <b>Which <i>exact text was supposed to be italics?</i></b>

      While this may not give you exactly what the original poster intended, it does help to get your code to validate properly.

      print substr("Just another Perl hacker", 0, -2);
      - apotheon
      CopyWrite Chad Perrin

Re: Ensuring HTML is "balanced"
by ambrus (Abbot) on Mar 06, 2006 at 20:48 UTC
Re: Ensuring HTML is "balanced"
by rvosa (Curate) on Mar 07, 2006 at 04:27 UTC
      I have used this before with much success, the sole biggest problem HTML::Tidy has, is it's inability to rip out language specific syntax. IE, ASP.


      Evan Carroll
      www.EvanCarroll.com
Re: Ensuring HTML is "balanced"
by insaniac (Friar) on Mar 07, 2006 at 08:56 UTC
    hm... do you know Text::Balanced (a great module by Damian)?

    to ask a question is a moment of shame
    to remain ignorant is a lifelong shame

      Did you even bother to read the question? Text::Balanced parses strings with nested paired delimiters. Apart from the fact that it has “balanced” in its name, it has nothing to do with the OP’s problem.

      What baffles me even more is that >20 people upvoted this vacuous suggestion, for whatever reason.

      Makeshifts last the longest.

        I did bother... and IMVHO I think it could help out the OP.

        Ok, I didn't post *the* solution, but you probably also know there's always more than one way to solve a problem in Perl. This one isn't maybe the easiest, or fastest, or most elegant one... but I still think it could help.

        btw: you can always downvote me if you don't like what I said ;)

        pussy ass code to proof (I hate proofing) I did bother:

        to ask a question is a moment of shame
        to remain ignorant is a lifelong shame

Re: Ensuring HTML is "balanced"
by DrHyde (Prior) on Mar 08, 2006 at 09:41 UTC
    Actually, fixing that is pretty easy. Yes, you keep a stack of opening tags that you're within, and then whenever you find a closing tag which doesn't match the top of the stack, you need to add the right closing tag and then try again - and again, and again, and again, until everything's back in sync. You also need to handle the end of the document correctly so that you automagically close anything left on the stack.

    With a little more trickery (but only a little) you can take account of stuff like <HR> and <BR> not needing to be closed, that only certain tags are legal immediately inside others (eg <TR> is legal inside <TABLE> but <HR> isn't) and so on.

    Even if you ignore all those special cases, you'll have a good solution.

      . . . except that you also need to account for closing tags that aren't nested properly. For instance, in the following example, using that simple stack approach would give you extra closing tags:

      Try <i><b>this</i></b> on for size.

      Instead of fixing improper nesting, a straight-up stack matching approach would give you this:

      Try <i><b>this</b></i></b> on for size.

      It's also probably best these days to stick to valid XHTML, which means that all tags get closed (for instance, use <hr /> instead of <hr>).

      print substr("Just another Perl hacker", 0, -2);
      - apotheon
      CopyWrite Chad Perrin

        Detecting that error is so trivial that I thought it not worthy to mention.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://534748]
Approved by spiritway
Front-paged by friedo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (6)
As of 2024-04-26 09:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found