Re: Ensuring HTML is "balanced"
by GrandFather (Saint) on Mar 06, 2006 at 21:06 UTC
|
HTML::Treebuilder is a good answer. It is pretty tolerant of missing close tags and can generate nice HTML output if you ask it nicely. You may also be interested in HTML::Lint which parses HTML and generates an error report.
use strict;
use warnings;
use HTML::TreeBuilder;
use HTML::Lint;
my $html = do {local $/; (<DATA>)};
my $lint = HTML::Lint->new (only_types => HTML::Lint::Error::STRUCTURE
+);
$lint->parse ($html);
$lint->eof ();
print "HTML::Lint report:\n";
print join "\n", map {$_->as_string ()} $lint->errors ();
my $tree = HTML::TreeBuilder->new ();
$tree->parse ($html);
$tree->eof ();
print "\n\nTreeBuilder cleaned up HTML\n";
print $tree->as_HTML ();
__DATA__
<p><b><i>test</b></p>
Prints:
HTML::Lint report:
(1:14) <i> at (1:7) is never closed
(1:18) <body> tag is required
(1:18) <head> tag is required
(1:18) <html> tag is required
(1:18) <title> tag is required
TreeBuilder cleaned up HTML
<html><head></head><body><p><b><i>test</i></b></body></html>
DWIM is Perl's answer to Gödel
| [reply] [d/l] [select] |
|
...So since the cleaned up HTML in fact has a broken p tag, are we free to assume that the Lint report *and* Treebuilder handle P tags in an amusing manner?
| [reply] |
|
| [reply] |
|
| [reply] |
Re: Ensuring HTML is "balanced"
by webfiend (Vicar) on Mar 07, 2006 at 00:20 UTC
|
If your solution doesn't absolutely, positively have to be in Perl, then maybe a call to HTML Tidy would be the appropriate solution. It takes broken input and does its best to send it back to you as clean HTML. I use it in a lot of my projects.
Another solution I use is to not allow HTML formatting in user input, but maybe you don't want to force your users into learning one of the various HTML shorthand languages. Still, using Text::Textile to generate HTML from user input might be a little easier than making sure your users are always creating correct markup on their own.
| [reply] |
Re: Ensuring HTML is "balanced"
by spiritway (Vicar) on Mar 06, 2006 at 20:37 UTC
|
I'm not sure if this is what you want, but, using search terms "HTML tags" I found a couple of possibilities: HTML::TagUtil, and HTML::EasyTags. While they may not do exactly what you want, they may provide information about the tags for you to make the desired corrections to your input.
The problem, of couse, is that it's often not possible to know exactly where the tags were intended. For example:
<b>Which <i>exact text was supposed to be italics?</b>. Where does the </i> tag go? In more complex text, it is likely that only the original author knows just where the tags should have been placed - if, in fact, s/he even knows.
| [reply] [d/l] |
|
It's true that knowing where the tag should be closed can be a tricky guesstimation at best in many circumstances. There's a simple way to decide where to put the closing tag when in doubt, though: just stick it in the last possible place to have it nest properly. Thus, in your example, the </i> would be placed just before the </b>, like so:
<b>Which <i>exact text was supposed to be italics?</i></b>
While this may not give you exactly what the original poster intended, it does help to get your code to validate properly.
print substr("Just another Perl hacker", 0, -2); |
|
- apotheon
CopyWrite Chad Perrin |
| [reply] [d/l] |
Re: Ensuring HTML is "balanced"
by ambrus (Abbot) on Mar 06, 2006 at 20:48 UTC
|
| [reply] |
Re: Ensuring HTML is "balanced"
by rvosa (Curate) on Mar 07, 2006 at 04:27 UTC
|
| [reply] |
|
| [reply] |
Re: Ensuring HTML is "balanced"
by insaniac (Friar) on Mar 07, 2006 at 08:56 UTC
|
hm... do you know Text::Balanced (a great module by Damian)?
to ask a question is a moment of shame
to remain ignorant is a lifelong shame
| [reply] |
|
Did you even bother to read the question? Text::Balanced parses strings with nested paired delimiters. Apart from the fact that it has “balanced” in its name, it has nothing to do with the OP’s problem.
What baffles me even more is that >20 people upvoted this vacuous suggestion, for whatever reason.
Makeshifts last the longest.
| [reply] |
|
I did bother... and IMVHO I think it could help out the OP.
Ok, I didn't post *the* solution, but you probably also know there's always more than one way to solve a problem in Perl. This one isn't maybe the easiest, or fastest, or most elegant one... but I still think it could help.
btw: you can always downvote me if you don't like what I said ;)
pussy ass code to proof (I hate proofing) I did bother:
to ask a question is a moment of shame
to remain ignorant is a lifelong shame
| [reply] [d/l] [select] |
Re: Ensuring HTML is "balanced"
by DrHyde (Prior) on Mar 08, 2006 at 09:41 UTC
|
Actually, fixing that is pretty easy. Yes, you keep a stack of opening tags that you're within, and then whenever you find a closing tag which doesn't match the top of the stack, you need to add the right closing tag and then try again - and again, and again, and again, until everything's back in sync. You also need to handle the end of the document correctly so that you automagically close anything left on the stack.
With a little more trickery (but only a little) you can take account of stuff like <HR> and <BR> not needing to be closed, that only certain tags are legal immediately inside others (eg <TR> is legal inside <TABLE> but <HR> isn't) and so on.
Even if you ignore all those special cases, you'll have a good solution. | [reply] |
|
. . . except that you also need to account for closing tags that aren't nested properly. For instance, in the following example, using that simple stack approach would give you extra closing tags:
Try <i><b>this</i></b> on for size.
Instead of fixing improper nesting, a straight-up stack matching approach would give you this:
Try <i><b>this</b></i></b> on for size.
It's also probably best these days to stick to valid XHTML, which means that all tags get closed (for instance, use <hr /> instead of <hr>).
print substr("Just another Perl hacker", 0, -2); |
|
- apotheon
CopyWrite Chad Perrin |
| [reply] [d/l] [select] |
|
Detecting that error is so trivial that I thought it not worthy to mention.
| [reply] |