This matches "regular" HTML tags -- the part that matches the
element may need to be changed slightly, but other than that,
it matches: <ELEMENT ( ATTR ( = VALUE )? )* >.
my $open = qr{
< [a-zA-Z][a-zA-Z0-9]*
(?:
\s+ \w+
(?: \s* = \s*
(?: "[^"]*" | '[^']*' | [^\s>]* )
)?
)*
\s*
>
}x;
The closing tags are far simpler:
my $close = qr{
< / \s* [a-zA-Z][a-zA-Z0-9]* \s* >
}x;
Comments are slightly trickier:
# the following are comments:
# <!-- ab -- cd --> <!-- ab --> <!---->
# <!-- ab -- cd -- > <!-- ab -- > <!---- >
my $comment = qr{
<!-- # <!--
[^-]* # 0 or more non -'s
(?:
(?! -- \s* > ) # that's not --, space, then >
- # a -
[^-]* # 0 or more non -'s
)* # 0 or more times
-- \s* > # --, space, then >
}x;
The DTD tag is more difficult. There are specific classes
of DTD tags (see
the specs). So right onw I don't have a regex to handle
them. But combining the other three regexes:
while ($HTML =~ /\G($open|$close|$comment|[^<]+)/g) {
# do something with $1
}
Now, using this to create a tree structure of an HTML file
shouldn't be too complicated, especially if we use a nice
trick like:
# requires the (?{...}) structure
use re 'eval';
while ($HTML =~ m{
\G
(
$open (?{ $STATE = 'open' }) |
$close (?{ $STATE = 'close' }) |
$comment (?{ $STATE = 'comment' }) |
[^<]+ (?{ $STATE = 'TEXT' })
)
}xg) {
# do something with $1 and $STATE
}
And you can modify $open and $close to
keep track of the element name by putting parens in there.
It's a matter of thoroughness.
japhy --
Perl and Regex Hacker |