http://qs321.pair.com?node_id=326152

History

For a long time, we've wanted to catch mis-nested HTML tags so that mistakes (or malice) in one node can't interfere with the display of elements that appear after the node.

The first attempt was to move from our own hand-rolled regex to a standard CPAN module that could parse and clean up HTML. This was pretty well shot down by being 10-times slower than our regex (plus not being as well suited for dealing with hand-written HTML).

Later, libXML showed up and looked promising. Unfortunately, testing showed that, although it can be configured to be forgiving of broken HTML and try to correct it, even in that mode it is quite easy to get it to die (with mis-nested HTML).

After a few years of thinking about the problem very infrequently, the subject was brought up again and I suddenly felt that I might have a handle on how to do a pretty good job of solving the problem rather simply.

I threw some code I came up with (but didn't even try to compile) on tye's scratchpad and knew I'd come back to it at some point in the distant future.

The other day, I couldn't access the code I wanted to work on so I entertained myself with this old code instead.

Now testing

After several rounds of testing, moving closer and closer to the PerlMonks' "production" environment, I've now made the code available to be used on PerlMonks (as of Monday afternoon, California time).

I encourage you to go to user display settings and turn on the 'enforce proper nesting of HTML' option. This option will go away (becoming mandatory) when the feature has been tested enough.

In addition, you can append ;htmlnest=1 to any PerlMonks URL to enable the feature temporarily. Note that doing so (if you haven't also enabled the option in user display settings) will make visible the previous recent (ugly) hack to prevent unclosed tags in the chatterbox from running amok. This side-effect if partly due to laziness and partly to provide an easy-to-find example where you can see the feature have an effect.

If you find a problem, reply in this thread.

If you have HTML nesting enforced (by either of the above methods), then you can add ;htmlerror=1 to a PerlMonks URL to have missing closing tags displayed in grey (with span class="htmlerror"). This will probably be enabled when previewing (with an option to turnit off after the first preview).

Details

The proper nesting of HTML tags is enforced via the following rules and the exceptions that follow them.

The HTML you type in is scanned from beginning to end. When an opening tag (or 'empty' tag like BR, HR, or IMG) is encountered, if it isn't on the list of approved tags (Perl Monks Approved HTML tags or PerlMonks Approved Chatter HTML Tags), then it is encoded into HTML entities so that it will get displayed literally (this part isn't new -- see More HTML escaping).

If the opening/empty tag is approved, then any attributes present are filtered: unapproved attributes are silently thrown away, unquoted attribute values have quotes added, any square brackets are converted to HTML entities, a trailing " /" is added (if missing) for empty tags, spacing is normalized, duplicate attributes are removed, and the tag name and attribute names are all converted to lowercase. Note that, regardless of any HTML standards, PerlMonks does not let you include a literal < nor > inside of HTML tags. (Most of this isn't new.)

Opening (non-empty) tags are tracked to ensure they get closed in the reverse order.

When a closing tag is found, if that tag has never been opened, then the tag is converted to HTML entities so that it will appear literally. If it is not a block-level tag and was opened in a previous block (not in the current block) then it is also escaped so it will appear literally (a misplaced non-block-level closing tags won't force any blocks to be closed).

[ Block-level (or block-like) tags (versus in-line or character-level tags) are defined bythe HTML standard. For PerlMonks HTML filtering, the block-level tags are: H1..H6, DL, UL, OL, PRE, P, DIV, BLOCKQUOTE, FORM, and TABLE. ]

Otherwise, the closing tag is kept but is preceeded by whatever closing tags are needed to close any tags that were opened after this one.

A few tags are designated as non-nesting. If you open one of these tags twice inside the same block, then instead of nesting, the first tag is closed (along with any nested tags) before the second tag is opened. For PerlMonks HTML filtering, the non-nesting tags are: LI, TR, TH, TD, and P. Note that you can nest these tags by enclosing the inner one inside of a block tag.

When we reach the end of your typing, we close any tags you left open.

Any closing tags that had to be inserted will also be displayed if ;htmlerror=1 was present in the PerlMonks URL.

One other way that PerlMonks intentionally departs from standard HTML is how it handles comments. PerlMonks HTML comments simply start with <!-- and simply end with -->. Any occurrances of "--" inside the comment get changed to "- -" so that the result is always a standards-complient HTML comment. Using an HTML comment like <! -- foo -- > will cause the < to be displayed literally, since it isn't part of a PerlMonks approved HTML tag.

I'll include some examples in a reply (inside a READMORE so they won't be obnoxious to monks who don't have 'htmlnest' enabled).

- tye