http://qs321.pair.com?node_id=201778

mp has asked for the wisdom of the Perl Monks concerning the following question:

Is HTML::TokeParser reliable for parsing HTML that has additional non-HTML tags. (<column> </column> in the example input below)?

Example input:

<column>Colum <b>One</b> Header</column> <column>Column <u>Two</u> Header</column> <column na="1">Etcetera</column>

The code below seems to work, I just want to make sure that there are no gotchas with regards to using tags that look like HTML but really aren't valid html (things in angle brackets with optional attributes and optional slash indicating closing tag). I prefer to use HTML::TokeParser over XML::TokeParser because the text between the 'column' tags will in general not be well-formed XML.

use HTML::TokeParser; sub parse_column_list { my ($str) = @_; my $p = HTML::TokeParser->new(\$str); my (@cl, $label, %attr); my %attr_default = ( na => 0 ); while(my $t = $p->get_token) { if ($t->[0] eq "S" and $t->[1] eq "column") { $label = ''; %attr = (%attr_default, %{$t->[2]}); } elsif ($t->[0] eq "E" and $t->[1] eq "column") { push @cl, { %attr, label => $label }; } else { if($t->[0] eq "T") { $label .= $t->[1]; } else { $label .= $t->[-1]; } } } return \@cl; }