clean html tags

InfiniteLoop has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

Recently I have been tasked with modifying my custom module to output "clean html", given the following html snippet:

<B>TEXT</B><BR>Foo & BAR
[download]

has to be formatted thusly:

<B>TEXT</B><BR>FOO &amp; BAR
[download]

basically, I need to convert the "text" to their html entity, at the same time retaining the html tags. I have looked at few modules on CPAN, namely:

I liked the HTML::Tidy, except for the fact, that it required libtidy, and I might not be able to package libtidy as part of the release.

Regards to HTML::Lint, the documentation does not give any methods to access the cleaned html chunk.

Do you know of any "all perl" html tidy like module ?

Comment on clean html tags Select or Download Code

Replies are listed 'Best First'.
Re: clean html tags by ww (Archbishop) on Jan 25, 2007 at 18:43 UTC
BEWARE: THIN ICE! Let us pretend, for this discussion, that I regard the sample in the OP as something approximating "clean html" (ah shucks; just say it: IMO, YMMV, that IS NOT clean; that's flat out ugly!) OK, back to pretending. Suppose you have a partially "clean html" file to deal with... say something that contains a line not too different from yours... `<B>TEXT & MORE TEXT</B><BR>FOO   BAR` [download] where the originator, for whatever reason, knew that one can force a browser to render multiple, consecutive spaces by inserting a charentity space, ` ` between each pair of `0x20>`s. Simply converting each ampersand to its charentity will not produce the outcome you want; rather, you'll get something like this: `<B>TEXT & MORE TEXT</B><BR>FOO &nbsp; BAR` which will render as: TEXT & MORE TEXT FOO   BAR Or, suppose the incoming html is badly formed (mis-nested, for example): you're still going to have to rely on the Mark I eyeball or one of the packages discussed elsewhere in this thread to "clean" that, unless the definition of "clean html" is restricted to enforcing use of character entities. And, finally (by way of illustrating why my opening jape is not mere ill-temper) while the following is open to numerous criticisms (failure to use the "strict" doctype; loading up the keywords meta; style definitions included in-page rather than linked, etc, etc, etc) IT IS valid -- ie, "clean" -- html per w3c's 4.01 standard.: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http:/ +/www.w3.org/TR/html4/loose.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> <title>Clean html </title> <meta name="description" content="clean code for illustration"> <meta name="keywords" content="html, clean, 'character entities', char +entity"> <meta http-equiv="Content-Style-Type" content="text/css"> <style type="text/css"> <!-- .b { font-weight: bold; } --> </style> </head> <body> <p> Re: character entities (charentity) and how to clean up html</p> <p><span class="b">TEXT & MORE TEXT</span> <br> FOO   BAR </p> </body> </html> [download] FWIW, and without deprecating the desire to do this with Perl, you might consider the standalone version of Tidy for html or a commercial validator.	[reply] [d/l] [select]
Re^2: clean html tags by wfsp (Abbot) on Jan 25, 2007 at 20:01 UTC
...you might consider the standalone version of Tidy for html... I agree. It's very nifty indeed, easy to use and highly configurable. Something as simple as: `my $in = 'bad.html'; my $out = 'tidied.html'; my $err = 'tidy.err'; my $cnf = 'tidy.cnf'; system( 'tidy.exe', '-asxml', -config => $cnf, -file => $err, -output => $out, $in, );` [download] can be easily adapted to process a list or even a local copy of a web site. The config file can be tweaked to be severe or lenient to taste. You can easily interrogate all the error files to get a good picture of how bad the html is (and there is a lot of it about!). Again, I agree. Why bother to go to a lot of trouble when there is a very clever bit of kit available.	[reply] [d/l]
Re^2: clean html tags by InfiniteLoop (Hermit) on Jan 25, 2007 at 18:52 UTC
point taken. Thanks.	[reply]
Re: clean html tags by madbombX (Hermit) on Jan 25, 2007 at 19:32 UTC
Have you looked at HTML::Entities? That's exactly what it was made to do.	[reply]
Re: clean html tags by sgifford (Prior) on Jan 25, 2007 at 17:38 UTC
For just escaping HTML entities, I use this code: `{ # closure my %HTML_ESCAPE = ( "\xa0" => " ", "&" => "&", "'" => "'", "\"" => """, "<" => "<", ">" => ">", ); sub html_escape { return '' unless defined($_[0]); (my $t=$_[0]) =~ s/([\xa0\'\"&<>])/$HTML_ESCAPE{$1}/g; $t; } }` [download] It's best to escape the data as it's coming in; otherwise it's very difficult to distinguish between, for example, a less-than sign that should be converted to `<` and one that is part of the markup. -- sgifford's Web page	[reply] [d/l] [select]
Re^2: clean html tags by dorward (Curate) on Jan 26, 2007 at 10:14 UTC
`"'" => "'",` The apos entity is an XML built it, and isn't defined for HTML. While some browsers support it in text/html documents, this is error correction and you should not use it. It's best to escape the data as it's coming in; otherwise it's very difficult to distinguish between, for example, a less-than sign that should be converted to < and one that is part of the markup. My preference is to convert from text to HTML at the last minute to avoid issues where I need to manipulate the data in Perl. (Template::Stash::EscapeHTML is quite cool). What matters though is doing it in one place, so its easy to spot when you forget to protect a bit of user input from XSS et al.	[reply] [d/l]
Re^3: clean html tags by sgifford (Prior) on Jan 26, 2007 at 19:30 UTC
The apos entity is an XML built it, and isn't defined for HTML. While some browsers support it in text/html documents, this is error correction and you should not use it. Ah, that's interesting. I find it very useful to ensure that user-generated text doesn't break out of an HTML or JavaScript string, which is a big win IMHO. For example, if a template says: `<img src='$IMAGE1' alt='$DESCRIPTION1'>` [download] I can be sure that `$IMAGE1` and `$DESCRIPTION1` won't mess up my HTML formatting if I can ensure it doesn't have apostrophes, but otherwise it's impossible. Are you aware of any browsers that don't support this entity in HTML? -- sgifford's Web page	[reply] [d/l] [select]
Re^4: clean html tags by dorward (Curate) on Jan 27, 2007 at 01:01 UTC
Re^5: clean html tags by sgifford (Prior) on Mar 20, 2007 at 21:17 UTC
Re: clean html tags by Anonymous Monk on Jan 25, 2007 at 19:24 UTC
The previous poster is right in that this is not clean HTML. I also do not know whether you merely want to escape HTML entities, however, you might want to try HTML::TreeBuilder to build a HTML parse tree and HTML::Element, specifically the `as_HTML()` sub, to output the HTML. You can use `as_HTML` to escape HTML entities. Johannes	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom