Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
BEWARE: THIN ICE!

Let us pretend, for this discussion, that I regard the sample in the OP as something approximating "clean html" (ah shucks; just say it: IMO, YMMV, that IS NOT clean; that's flat out ugly!)   OK, back to pretending.

Suppose you have a partially "clean html" file to deal with... say something that contains a line not too different from yours...

<B>TEXT & MORE TEXT</B><BR>FOO &nbsp; BAR
where the originator, for whatever reason, knew that one can force a browser to render multiple, consecutive spaces by inserting a charentity space, &nbsp; between each pair of 0x20>s.

Simply converting each ampersand to its charentity will not produce the outcome you want; rather, you'll get something like this:

<B>TEXT &amp; MORE TEXT</B><BR>FOO &amp;nbsp; BAR
which will render as:
TEXT & MORE TEXT
FOO &nbsp; BAR

Or, suppose the incoming html is badly formed (mis-nested, for example): you're still going to have to rely on the Mark I eyeball or one of the packages discussed elsewhere in this thread to "clean" that, unless the definition of "clean html" is restricted to enforcing use of character entities.

And, finally (by way of illustrating why my opening jape is not mere ill-temper) while the following is open to numerous criticisms (failure to use the "strict" doctype; loading up the keywords meta; style definitions included in-page rather than linked, etc, etc, etc) IT IS valid -- ie, "clean" -- html per w3c's 4.01 standard.:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http:/ +/www.w3.org/TR/html4/loose.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> <title>Clean html </title> <meta name="description" content="clean code for illustration"> <meta name="keywords" content="html, clean, 'character entities', char +entity"> <meta http-equiv="Content-Style-Type" content="text/css"> <style type="text/css"> <!-- .b { font-weight: bold; } --> </style> </head> <body> <p> Re: character entities (charentity) and how to clean up html</p> <p><span class="b">TEXT &amp; MORE TEXT</span> <br> FOO &nbsp; BAR </p> </body> </html>
FWIW, and without deprecating the desire to do this with Perl, you might consider the standalone version of Tidy for html or a commercial validator.

In reply to Re: clean html tags by ww
in thread clean html tags by InfiniteLoop

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2023-02-04 21:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (31 votes). Check out past polls.

    Notices?