http://qs321.pair.com?node_id=596572


in reply to clean html tags

For just escaping HTML entities, I use this code:
{ # closure my %HTML_ESCAPE = ( "\xa0" => "&nbsp;", "&" => "&amp;", "'" => "&apos;", "\"" => "&quot;", "<" => "&lt;", ">" => "&gt;", ); sub html_escape { return '' unless defined($_[0]); (my $t=$_[0]) =~ s/([\xa0\'\"&<>])/$HTML_ESCAPE{$1}/g; $t; } }
It's best to escape the data as it's coming in; otherwise it's very difficult to distinguish between, for example, a less-than sign that should be converted to &lt; and one that is part of the markup.

Replies are listed 'Best First'.
Re^2: clean html tags
by dorward (Curate) on Jan 26, 2007 at 10:14 UTC
    "'" => "&apos;",

    The apos entity is an XML built it, and isn't defined for HTML. While some browsers support it in text/html documents, this is error correction and you should not use it.

    It's best to escape the data as it's coming in; otherwise it's very difficult to distinguish between, for example, a less-than sign that should be converted to < and one that is part of the markup.

    My preference is to convert from text to HTML at the last minute to avoid issues where I need to manipulate the data in Perl. (Template::Stash::EscapeHTML is quite cool).

    What matters though is doing it in one place, so its easy to spot when you forget to protect a bit of user input from XSS et al.

      The apos entity is an XML built it, and isn't defined for HTML. While some browsers support it in text/html documents, this is error correction and you should not use it.
      Ah, that's interesting. I find it very useful to ensure that user-generated text doesn't break out of an HTML or JavaScript string, which is a big win IMHO. For example, if a template says:
      <img src='$IMAGE1' alt='$DESCRIPTION1'>
      I can be sure that $IMAGE1 and $DESCRIPTION1 won't mess up my HTML formatting if I can ensure it doesn't have apostrophes, but otherwise it's impossible.

      Are you aware of any browsers that don't support this entity in HTML?

        Ah, that's interesting. I find it very useful to ensure that user-generated text doesn't break out of an HTML or JavaScript string

        You get the same effect if you use the numeric character reference as described in the document I previously linked to, or avoid delimiting attribute values with single quotes and use the more conventional double quotes.

        Are you aware of any browsers that don't support this entity in HTML?

        Not off the top of my head, but using it in text/html is non-standard, and its easy to avoid.