http://qs321.pair.com?node_id=192439

trs80 has asked for the wisdom of the Perl Monks concerning the following question:

I have been working on a project for the last few months and the one part of it parses HTML pages and creates online forms to allow editing the content of particular parts of said HTML pages. My problem seems to now lie in the fact that browsers are far more forgiving then my tool of choice, HTML::TreeBuilder, because HTML like this:
<html> <head> </head> <body> <table> <tr> <center> <td> </td> </center> </tr> </table> </body> <html>
Turns into:
<html><head> </head><body> <table> <tr> <td><center> </center></td> <td> </td> </tr> </table> </body> </html>
The stray <center> tag gets turned into a an extra <td> tag.
While the browser is able to handle the first BAD html, the resulting "corrected" html doesn't display correctly. I had been running the pages through tidy first, but tidy seems to poorly handle many of these cases as well and resulted in even worse formatting.

This is just one example of the type of HTML I need to deal with, skys the limit for what other ill formed documents await me. There is a difference between versions of HTML::TreeBuilder as well, I was running an older copy and upgraded today to the latest version to make sure it wasn't a bug that had been fixed. The results vary from older versions to the latest, but still don't keep the bad html.

here is a sample script:
use strict; my $html; while (<DATA>) { $html .= $_; } my $tree = HTML::TreeBuilder->new(); $tree->parse($html); print $tree->as_HTML(); __DATA__ <html> <head> </head> <body> <table> <tr> <center> <td> </td> </center> </tr> </table> </body> <html>

Replies are listed 'Best First'.
Re: Keeping bad HTML bad
by fruiture (Curate) on Aug 23, 2002 at 20:32 UTC

    First of all: check the HTML in your Post ;)

    Secondly: Simple, you cannot handle something as HTML which isn't HTML. Call me stubborn, but if someone enters such wrong (and deprecated) stuff he'll have to live with the consequences.

    --
    http://fruiture.de
      This isn't an item of choice for me. I am getting badly formated HTML and I have to do certain tasks with it. The code may come from "popular" edits and if it displays in a browser when I get it, it has to display the same way when it leaves. What specificly are you referring to in the HTML in my post so I might better explain whether it is the problem or simply my mistake, but if you are referring to the misplaced center tag, that is unfortunately exactly as it is in one of the documents I am working with.
        if it displays in a browser when I get it, it has to display the same way when it leaves

        Then you don't want to use HTML::TreeBuilder on it if it's not interpreting your bad html correctly. Maybe if you expanded on what sections of the html you are allowing the user to change we can help with a solution.
Re: Keeping bad HTML bad
by ducky (Scribe) on Aug 23, 2002 at 23:40 UTC

    After reading your replies to further questions, it sounds like you don't really need something that truely understands html to the fullest, but a parser that groks the concept of text, html tags and their attributes, and html comments.

    My suggestion would be to write something that can pick apart those 3 things and do the small edits you need them to and reassemble the file, mal-formedness and all.

    If you proceed down the path that the browsers did and try to guess a documents meaning (via your own code or a module), I believe you'll end up far too often fixing html by hand, either on the way in or the way out and much sadness will ensue.

    To offer at least something and not leave you with, "Oh ya that sucks. Good luck!" Here's a bit code that might help. It's inefficient, but ok for moderate sized data. (I just happened to run across this yesterday in my playground/lets-see-if-I-can-do-X perl script dir):

    sub parse { my $source = shift ; my @output ; while ( $source =~ s/^([<]*)<//s ) { push @output, [ 'text', $1 ] if $1 ne '' ; $source =~ s/^([^\s>]*)([\s>])//s ; my $tag = $1 ; if ( $tag eq '!--' ) { $source =~ s/^(.*?)-->//s ; push @output, [ 'comment', $1 ] ; } else { my $param = '' ; if ( $2 ne '>' ) { while ( $source =~ s/^([^'">]*)(["'>])//s ) { $param .= $1 ; last if $2 eq '>' ; my $quote = $2 ; $source =~ s/^([^$quote]*$quote)// ; $param .= $quote . $1 . $quote ; } } push @output, ['tag', $tag, $param ] ; } } return @output ; }
    Keep in mind, it's largely untested, written like 4 years ago mostly as an exercise.

    What it does is takes a whole file and returns you an array of array refs. Each array ref contains 2 or 3 elements: what type of data it was (text, tag, comment), the data itself (minus the actual markup in the case of tags), and if the type is a tag, then the 3rd element is tag params.

    HTH

    -Ducky

Re: Keeping bad HTML bad
by Anonymous Monk on Aug 24, 2002 at 01:21 UTC
    *sigh*

    Why don't you just forget about HTML::TreeBuilder, and parse your stuff with HTML::Parser, doing the right thing when you encounter that type of illegal html.

    Better yet, why don't you correct it. I have no idea what "browsers" do with that kind of blooney, but I see no reason for you to keep it alive.

    If the appropriate translation is <td align="center">, then just make it.

    As far as I know, most if not all browsers just ignore the stray CENTER tags.

    You cannot turn to any tool to automagically do with INSANELY STRAY AND ILLEGAL input any kind of ILLOGICAL thing you can think up.

Re: Keeping bad HTML bad
by trs80 (Priest) on Aug 24, 2002 at 16:52 UTC
    Shame on me. RTFM.
    I went to all the trouble of running through a debugger to see where it was doing the "correction" on the HTML when the answer was in the documentation if I had known better what to look for.

    At line 291 it does a check for valid tags within other tags if the '_implict_tags' value is set. This is covered in the documentation with this:
    $root->implicit_tags(value) Setting this attribute to true will instruct the parser to try to deduce implicit elements and implicit end tags. If it is false you get a parse tree that just reflects the text as it stands, which is unlikely to be useful for anything but quick and dirty parsing. (In fact, I'd be curious to hear from anyone who finds it useful to have implicit_tags set to false.) Default is true. Implicit elements have the implicit() attribute set.


    This seems to correct the problem or should I say allow the problem to persist. Thanks for all the responses.
Re: Keeping bad HTML bad
by adrianh (Chancellor) on Aug 24, 2002 at 20:35 UTC

    You're going to have problems with HTML parsers - since, as everybody has pointed out, it's not really HTML.

    If you are in a position where you cannot force who/whatever is producing the broken HTML to stick to standards the easist alternative is to treat it as a string or a sequence of tags rather than a tree structure.

    I had a similar problem several years back, which I resolved by simply adding special comments around the content that the user had to edit. Something like:

    some stuff <!-- start editable foo/bar --> some more stuff <!-- end editable foo/bar --> even more stuff

    The "editable" stuff could then be extracted with some simple regexes.

    Without some more info on what kind of transformations you're trying to apply to the source it's a little difficult to give more specific advice. Can you give us more of an idea of what you're trying to do?

      This is a good suggestion, but in my case I am very limited in what I can do for the user as far as the HTML, and all comments are removed (and are to be removed by client request) from all pages processed. I go into some specifics in one of my earlier replies, but to rephrase and recap what I am doing:

      • Retrieve remote document via HTTP ( LWP::UserAgent, HTTP::Request )
      • Parse document for local storage and confirm that it's format isn't horribly disgusting ( HTML::TreeBuilder )
      • Allow editing of title tag, meta tags, anchor tag title attribute, and img tag alt attribute.
      The forms for the editing are created by relying on where each tag is located inside of the element array created by HTML::TreeBuilder. That is if a person selects alt tags as way they want to edit each img tag is located using the look_down method in an array context:
      my @img = $tree->look_down('_tag', 'img'); my $count; my $form; foreach my $element (@img) { # make a form element $form .= # call to CGI function, name = "img-$count" $count++; } return $form;
      Then when they submit the form the $count is referenced and the appropriate img tags alt content is replaced.

      But this is all moot since the issue was and is that HTML::TreeBuilder is "supposed" to handle bad HTML, since it uses HTML::Parser and one of the goals of HTML::Parser is to work with documents that are really out there, the example given should work with HTML::TreeBuilder and in fact it does, part of my problem was not turning off implicit_tags as one of my other replies above states. The implicit_tags is unique to the HTML::TreeBuilder module and it attempts to correct badly formated HTML, which 98% of the time is most likely a good thing, but at least the author designed in the ability to turn off that behavior in the 2% of the times it isn't a good thing.

      I have tested my ideas and have confirmed that setting that flag allows for the conditions I need, but results in a different anomaly, which I have contacted the author of the module about.