HTML Matching

spaz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML Matching by chromatic (Archbishop) on Nov 19, 2000 at 01:07 UTC
The FAQ answer (How do I remove HTML from a string?) suggests that HTML::Parse is the most correct answer. It also notes that HTML comments, tags that continue over line breaks, and angle brackets within quoted attributes can break a simpler parser. For example: `<!--- <img src="foo.jpg" alt="proof that 4 > 2"> -->` [download] If you're dealing with machine-generated HTML and can guarantee a certain degree of cleanliness, your solution will work. Otherwise, you really need a parser. And a parser will be slower, having to keep track of opening brackets and quotes. It's like the old saying, "Only perl can parse Perl."	[reply] [d/l]
Re: HTML Matching by autark (Friar) on Nov 19, 2000 at 01:08 UTC
A regexp will probably not do it right. Your regexp will fail on this example: <input type="text" value=">"> Why not just use HTML::Parser ? That would be the correct way of doing it. And it is fast too, both to write and execution. Just subclass HTML::Parser, and use the text method, like this: `package MyParser; use base 'HTML::Parser'; sub text { my($self, $origtext, $is_cdata) = @_; print $origtext; }` [download] The above code was just copied and pasted from the HTML::Parser pod file. Autark.	[reply] [d/l]
Re: HTML Matching by japhy (Canon) on Nov 19, 2000 at 01:45 UTC
Wow, I point to 7 Stages of Regex Users -- the point about using a regex to remove HTML tags. They can be matched by a regex (a long one, which I am actually working on) -- it just has to be comprehensive and well thought out. `japhy` -- Perl and Regex Hacker	[reply]
Re (tilly) 2: HTML Matching by tilly (Archbishop) on Nov 19, 2000 at 03:02 UTC
They can? Possibly, but I find myself dubious that it is actually possible...	[reply]
Re: Re (tilly) 2: HTML Matching by japhy (Canon) on Nov 19, 2000 at 03:26 UTC
This matches "regular" HTML tags -- the part that matches the element may need to be changed slightly, but other than that, it matches: `<ELEMENT ( ATTR ( = VALUE )? )* >`. `my $open = qr{ < [a-zA-Z][a-zA-Z0-9]* (?: \s+ \w+ (?: \s* = \s* (?: "[^"]" \| '[^']' \| [^\s>]* ) )? )* \s* > }x;` [download] The closing tags are far simpler: `my $close = qr{ < / \s* [a-zA-Z][a-zA-Z0-9]* \s* > }x;` [download] Comments are slightly trickier: `# the following are comments: # <!-- ab -- cd --> <!-- ab --> <!----> # <!-- ab -- cd -- > <!-- ab -- > <!---- > my $comment = qr{ <!-- # <!-- [^-]* # 0 or more non -'s (?: (?! -- \s* > ) # that's not --, space, then > - # a - [^-]* # 0 or more non -'s )* # 0 or more times -- \s* > # --, space, then > }x;` [download] The DTD tag is more difficult. There are specific classes of DTD tags (see the specs). So right onw I don't have a regex to handle them. But combining the other three regexes: `while ($HTML =~ /\G($open\|$close\|$comment\|[^<]+)/g) { # do something with $1 }` [download] Now, using this to create a tree structure of an HTML file shouldn't be too complicated, especially if we use a nice trick like: `# requires the (?{...}) structure use re 'eval'; while ($HTML =~ m{ \G ( $open (?{ $STATE = 'open' }) \| $close (?{ $STATE = 'close' }) \| $comment (?{ $STATE = 'comment' }) \| [^<]+ (?{ $STATE = 'TEXT' }) ) }xg) { # do something with $1 and $STATE }` [download] And you can modify `$open` and `$close` to keep track of the element name by putting parens in there. It's a matter of thoroughness. `japhy` -- Perl and Regex Hacker	[reply] [d/l] [select]
Re (tilly) 4: HTML Matching by tilly (Archbishop) on Nov 19, 2000 at 03:48 UTC
Re (tilly) 1: HTML Matching by tilly (Archbishop) on Nov 19, 2000 at 02:59 UTC
Why strip them out? Just use the encode_entities function from HTML::Entities on the string.	[reply]
Re: HTML Matching by cianoz (Friar) on Nov 19, 2000 at 01:16 UTC
your code fails with tags that span multiple lines, you should use something like: `$/ = undef; $_ = <STDIN>; s/<[^>]+?>//mg; print;` [download]	[reply] [d/l]
Re: Re: HTML Matching by autark (Friar) on Nov 19, 2000 at 01:24 UTC
But his regexp does not contain neither `^` nor `$`, so what good will `/m` do ? His choice of character class `[^>]` matches anything _except_ the '>' character, that includes newlines. Autark.	[reply] [d/l] [select]


Keep It Simple, Stupid
	PerlMonks