http://qs321.pair.com?node_id=192439

trs80 has asked for the wisdom of the Perl Monks concerning the following question:

I have been working on a project for the last few months and the one part of it parses HTML pages and creates online forms to allow editing the content of particular parts of said HTML pages. My problem seems to now lie in the fact that browsers are far more forgiving then my tool of choice, HTML::TreeBuilder, because HTML like this:
<html> <head> </head> <body> <table> <tr> <center> <td> </td> </center> </tr> </table> </body> <html>
Turns into:
<html><head> </head><body> <table> <tr> <td><center> </center></td> <td> </td> </tr> </table> </body> </html>
The stray <center> tag gets turned into a an extra <td> tag.
While the browser is able to handle the first BAD html, the resulting "corrected" html doesn't display correctly. I had been running the pages through tidy first, but tidy seems to poorly handle many of these cases as well and resulted in even worse formatting.

This is just one example of the type of HTML I need to deal with, skys the limit for what other ill formed documents await me. There is a difference between versions of HTML::TreeBuilder as well, I was running an older copy and upgraded today to the latest version to make sure it wasn't a bug that had been fixed. The results vary from older versions to the latest, but still don't keep the bad html.

here is a sample script:
use strict; my $html; while (<DATA>) { $html .= $_; } my $tree = HTML::TreeBuilder->new(); $tree->parse($html); print $tree->as_HTML(); __DATA__ <html> <head> </head> <body> <table> <tr> <center> <td> </td> </center> </tr> </table> </body> <html>