Re: Keeping bad HTML bad

You're going to have problems with HTML parsers - since, as everybody has pointed out, it's not really HTML.

If you are in a position where you cannot force who/whatever is producing the broken HTML to stick to standards the easist alternative is to treat it as a string or a sequence of tags rather than a tree structure.

I had a similar problem several years back, which I resolved by simply adding special comments around the content that the user had to edit. Something like:


some stuff
<!-- start editable foo/bar -->
some more stuff
<!-- end editable foo/bar -->
even more stuff
[download]

The "editable" stuff could then be extracted with some simple regexes.

Without some more info on what kind of transformations you're trying to apply to the source it's a little difficult to give more specific advice. Can you give us more of an idea of what you're trying to do?

Comment on Re: Keeping bad HTML bad Download Code

Replies are listed 'Best First'.

Re: Re: Keeping bad HTML bad
by trs80 (Priest) on Aug 24, 2002 at 21:37 UTC

Retrieve remote document via HTTP ( LWP::UserAgent, HTTP::Request )
Parse document for local storage and confirm that it's format isn't horribly disgusting ( HTML::TreeBuilder )
Allow editing of title tag, meta tags, anchor tag title attribute, and img tag alt attribute.

my @img = $tree->look_down('_tag', 'img');
my $count;
my $form;
foreach my $element (@img) {
    # make a form element
    $form .= # call to CGI function, name = "img-$count"
    $count++;
}
return $form;
[download]

[reply]
[d/l]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks