Re: clean html tags

BEWARE: THIN ICE!

Let us pretend, for this discussion, that I regard the sample in the OP as something approximating "clean html" (ah shucks; just say it: IMO, YMMV, that IS NOT clean; that's flat out ugly!) OK, back to pretending.

Suppose you have a partially "clean html" file to deal with... say something that contains a line not too different from yours...

<B>TEXT & MORE TEXT</B><BR>FOO &nbsp; BAR
[download]

where the originator, for whatever reason, knew that one can force a browser to render multiple, consecutive spaces by inserting a charentity space,   between each pair of 0x20>s.

Simply converting each ampersand to its charentity will not produce the outcome you want; rather, you'll get something like this:

<B>TEXT & MORE TEXT</B><BR>FOO &nbsp; BAR

which will render as:

TEXT & MORE TEXT
FOO   BAR

Or, suppose the incoming html is badly formed (mis-nested, for example): you're still going to have to rely on the Mark I eyeball or one of the packages discussed elsewhere in this thread to "clean" that, unless the definition of "clean html" is restricted to enforcing use of character entities.

And, finally (by way of illustrating why my opening jape is not mere ill-temper) while the following is open to numerous criticisms (failure to use the "strict" doctype; loading up the keywords meta; style definitions included in-page rather than linked, etc, etc, etc) IT IS valid -- ie, "clean" -- html per w3c's 4.01 standard.:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http:/
+/www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<title>Clean html </title>
<meta name="description" content="clean code for illustration">
<meta name="keywords" content="html, clean, 'character entities', char
+entity">
<meta http-equiv="Content-Style-Type" content="text/css">
<style type="text/css">
<!--
.b {
font-weight: bold;
}
-->
</style>
</head>
<body>

<p> Re: character entities (charentity) and how to clean up html</p>

<p><span class="b">TEXT &amp; MORE TEXT</span>
<br>
FOO &nbsp; BAR
</p>

</body>
</html>
[download]

FWIW, and without deprecating the desire to do this with Perl, you might consider the standalone version of Tidy for html or a commercial validator.

Comment on Re: clean html tags Select or Download Code

Replies are listed 'Best First'.
Re^2: clean html tags by wfsp (Abbot) on Jan 25, 2007 at 20:01 UTC
...you might consider the standalone version of Tidy for html... I agree. It's very nifty indeed, easy to use and highly configurable. Something as simple as: `my $in = 'bad.html'; my $out = 'tidied.html'; my $err = 'tidy.err'; my $cnf = 'tidy.cnf'; system( 'tidy.exe', '-asxml', -config => $cnf, -file => $err, -output => $out, $in, );` [download] can be easily adapted to process a list or even a local copy of a web site. The config file can be tweaked to be severe or lenient to taste. You can easily interrogate all the error files to get a good picture of how bad the html is (and there is a lot of it about!). Again, I agree. Why bother to go to a lot of trouble when there is a very clever bit of kit available.	[reply] [d/l]
Re^2: clean html tags by InfiniteLoop (Hermit) on Jan 25, 2007 at 18:52 UTC
point taken. Thanks.	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks