http://qs321.pair.com?node_id=42365

spaz has asked for the wisdom of the Perl Monks concerning the following question:

I have several scripts which need to strip out (NOT IDENTIFY) HTML tags.
I currently use s/<[^>]+?>//g to remove all HTML tags on a given line.

Is this the correct way to get it done?
Is this the fastest way to do what I want?

Replies are listed 'Best First'.
Re: HTML Matching
by chromatic (Archbishop) on Nov 19, 2000 at 01:07 UTC
    The FAQ answer (How do I remove HTML from a string?) suggests that HTML::Parse is the most correct answer.

    It also notes that HTML comments, tags that continue over line breaks, and angle brackets within quoted attributes can break a simpler parser. For example:

    <!--- <img src="foo.jpg" alt="proof that 4 > 2"> -->
    If you're dealing with machine-generated HTML and can guarantee a certain degree of cleanliness, your solution will work. Otherwise, you really need a parser. And a parser will be slower, having to keep track of opening brackets and quotes.

    It's like the old saying, "Only perl can parse Perl."

Re: HTML Matching
by autark (Friar) on Nov 19, 2000 at 01:08 UTC
    A regexp will probably not do it right. Your regexp will fail on this example:

    <input type="text" value=">">

    Why not just use HTML::Parser ? That would be the correct way of doing it. And it is fast too, both to write and execution. Just subclass HTML::Parser, and use the text method, like this:

    package MyParser; use base 'HTML::Parser'; sub text { my($self, $origtext, $is_cdata) = @_; print $origtext; }
    The above code was just copied and pasted from the HTML::Parser pod file.

    Autark.

Re: HTML Matching
by japhy (Canon) on Nov 19, 2000 at 01:45 UTC
    Wow, I point to 7 Stages of Regex Users -- the point about using a regex to remove HTML tags.

    They can be matched by a regex (a long one, which I am actually working on) -- it just has to be comprehensive and well thought out.

    japhy -- Perl and Regex Hacker
      They can? Possibly, but I find myself dubious that it is actually possible...
        This matches "regular" HTML tags -- the part that matches the element may need to be changed slightly, but other than that, it matches: <ELEMENT ( ATTR ( = VALUE )? )* >.
        my $open = qr{ < [a-zA-Z][a-zA-Z0-9]* (?: \s+ \w+ (?: \s* = \s* (?: "[^"]*" | '[^']*' | [^\s>]* ) )? )* \s* > }x;
        The closing tags are far simpler:
        my $close = qr{ < / \s* [a-zA-Z][a-zA-Z0-9]* \s* > }x;
        Comments are slightly trickier:
        # the following are comments: # <!-- ab -- cd --> <!-- ab --> <!----> # <!-- ab -- cd -- > <!-- ab -- > <!---- > my $comment = qr{ <!-- # <!-- [^-]* # 0 or more non -'s (?: (?! -- \s* > ) # that's not --, space, then > - # a - [^-]* # 0 or more non -'s )* # 0 or more times -- \s* > # --, space, then > }x;
        The DTD tag is more difficult. There are specific classes of DTD tags (see the specs). So right onw I don't have a regex to handle them. But combining the other three regexes:
        while ($HTML =~ /\G($open|$close|$comment|[^<]+)/g) { # do something with $1 }
        Now, using this to create a tree structure of an HTML file shouldn't be too complicated, especially if we use a nice trick like:
        # requires the (?{...}) structure use re 'eval'; while ($HTML =~ m{ \G ( $open (?{ $STATE = 'open' }) | $close (?{ $STATE = 'close' }) | $comment (?{ $STATE = 'comment' }) | [^<]+ (?{ $STATE = 'TEXT' }) ) }xg) { # do something with $1 and $STATE }
        And you can modify $open and $close to keep track of the element name by putting parens in there.

        It's a matter of thoroughness.

        japhy -- Perl and Regex Hacker
Re (tilly) 1: HTML Matching
by tilly (Archbishop) on Nov 19, 2000 at 02:59 UTC
    Why strip them out? Just use the encode_entities function from HTML::Entities on the string.
Re: HTML Matching
by cianoz (Friar) on Nov 19, 2000 at 01:16 UTC
    your code fails with tags that span multiple lines, you should use something like:
    $/ = undef; $_ = <STDIN>; s/<[^>]+?>//mg; print;
      But his regexp does not contain neither ^ nor $, so what good will /m do ?

      His choice of character class [^>] matches anything _except_ the '>' character, that includes newlines.

      Autark.