Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Parsing pseudo-HTML with HTML::TokeParser

by mp (Deacon)
on Sep 30, 2002 at 17:35 UTC ( #201778=perlquestion: print w/replies, xml ) Need Help??

mp has asked for the wisdom of the Perl Monks concerning the following question:

Is HTML::TokeParser reliable for parsing HTML that has additional non-HTML tags. (<column> </column> in the example input below)?

Example input:

<column>Colum <b>One</b> Header</column> <column>Column <u>Two</u> Header</column> <column na="1">Etcetera</column>

The code below seems to work, I just want to make sure that there are no gotchas with regards to using tags that look like HTML but really aren't valid html (things in angle brackets with optional attributes and optional slash indicating closing tag). I prefer to use HTML::TokeParser over XML::TokeParser because the text between the 'column' tags will in general not be well-formed XML.

use HTML::TokeParser; sub parse_column_list { my ($str) = @_; my $p = HTML::TokeParser->new(\$str); my (@cl, $label, %attr); my %attr_default = ( na => 0 ); while(my $t = $p->get_token) { if ($t->[0] eq "S" and $t->[1] eq "column") { $label = ''; %attr = (%attr_default, %{$t->[2]}); } elsif ($t->[0] eq "E" and $t->[1] eq "column") { push @cl, { %attr, label => $label }; } else { if($t->[0] eq "T") { $label .= $t->[1]; } else { $label .= $t->[-1]; } } } return \@cl; }

Replies are listed 'Best First'.
Re: Parsing pseudo-HTML with HTML::TokeParser
by Ovid (Cardinal) on Sep 30, 2002 at 21:20 UTC

    I'm not aware of any problems with parsing non-standard HTML. HTML is so mutable and browser dependant that unless you are using a tool that is requires a specific DTD, the code you use should be "fault tolerant", so to speak. As a side note, I'd recommend HTML::TokeParser::Simple (full disclosure: I wrote it). It makes your code shorter and easier to read. Here's a small demo.

    #!/usr/bin/perl -w use strict; use HTML::TokeParser::Simple; use Data::Dumper; my $pseudo_html; { local $/; $pseudo_html = <DATA>; } print Dumper parse_column_list( $pseudo_html ); sub parse_column_list { my ($str) = @_; my $p = HTML::TokeParser::Simple->new(\$str); my (@cl, $label, %attr); my %attr_default = ( na => 0 ); while(my $t = $p->get_token) { if ( $t->is_start_tag( 'column' ) ) { $label = ''; %attr = (%attr_default, %{$t->return_attr}); } elsif ( $t->is_end_tag( 'column' ) ) { push @cl, { %attr, label => $label }; } else { $label .= $t->return_text; } } return \@cl; } __DATA__ <column>Colum <b>One</b> Header</column> <column>Column <u>Two</u> Header</column> <column na="1">Etcetera</column>


    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: Parsing pseudo-HTML with HTML::TokeParser
by Helter (Chaplain) on Sep 30, 2002 at 17:54 UTC
    Reading about HTML::Parser:
    As markup and text is recognized, handlers are invoked. The following +method is used to set up handlers for different events:
    So I would assume that as long as there are no handlers assigned to those tags they would be ignored.
    On the other hand, you might want to define handlers for this code to make your processing life easier.

    I'm new to using these tools andh ave never used this one in particular so I'm just stating what I read, someone else could probably provide tested code/answers.

    Hope this helps!
Re: Parsing pseudo-HTML with HTML::TokeParser
by mp (Deacon) on Oct 02, 2002 at 15:56 UTC
    Thank you for the replies, and thanks for the pointer to HTML::TokeParser::Simple. It does improve the code's readability.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://201778]
Approved by BazB
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (1)
As of 2022-01-17 02:17 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (50 votes). Check out past polls.