trs80 has asked for the wisdom of the Perl Monks concerning the following question:
I have been working on a project for the last few months and the one part of it parses HTML pages and creates online forms to allow editing the content of particular parts of said HTML pages. My problem seems to now lie in the fact that browsers are far more forgiving then my tool of choice, HTML::TreeBuilder, because HTML like this:
Turns into:<html> <head> </head> <body> <table> <tr> <center> <td> </td> </center> </tr> </table> </body> <html>
The stray <center> tag gets turned into a an extra <td> tag.<html><head> </head><body> <table> <tr> <td><center> </center></td> <td> </td> </tr> </table> </body> </html>
While the browser is able to handle the first BAD html, the resulting "corrected" html doesn't display correctly. I had been running the pages through tidy first, but tidy seems to poorly handle many of these cases as well and resulted in even worse formatting.
This is just one example of the type of HTML I need to deal with, skys the limit for what other ill formed documents await me. There is a difference between versions of HTML::TreeBuilder as well, I was running an older copy and upgraded today to the latest version to make sure it wasn't a bug that had been fixed. The results vary from older versions to the latest, but still don't keep the bad html.
here is a sample script:
This is just one example of the type of HTML I need to deal with, skys the limit for what other ill formed documents await me. There is a difference between versions of HTML::TreeBuilder as well, I was running an older copy and upgraded today to the latest version to make sure it wasn't a bug that had been fixed. The results vary from older versions to the latest, but still don't keep the bad html.
here is a sample script:
use strict; my $html; while (<DATA>) { $html .= $_; } my $tree = HTML::TreeBuilder->new(); $tree->parse($html); print $tree->as_HTML(); __DATA__ <html> <head> </head> <body> <table> <tr> <center> <td> </td> </center> </tr> </table> </body> <html>
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Keeping bad HTML bad
by fruiture (Curate) on Aug 23, 2002 at 20:32 UTC | |
by trs80 (Priest) on Aug 23, 2002 at 20:52 UTC | |
by Abstraction (Friar) on Aug 23, 2002 at 21:10 UTC | |
by trs80 (Priest) on Aug 23, 2002 at 21:26 UTC | |
Re: Keeping bad HTML bad
by ducky (Scribe) on Aug 23, 2002 at 23:40 UTC | |
Re: Keeping bad HTML bad
by Anonymous Monk on Aug 24, 2002 at 01:21 UTC | |
Re: Keeping bad HTML bad
by trs80 (Priest) on Aug 24, 2002 at 16:52 UTC | |
Re: Keeping bad HTML bad
by adrianh (Chancellor) on Aug 24, 2002 at 20:35 UTC | |
by trs80 (Priest) on Aug 24, 2002 at 21:37 UTC |
Back to
Seekers of Perl Wisdom