http://qs321.pair.com?node_id=717646

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am trying to parse HTML that contains pseudo HTML tags using HTML::TokeParser but am running into a brick wall.

The HTML looks like this...

<CDATA>
Just some random whatever. It might have some <b>real</b> HTML like a table or CSS styling

or even some <H1>IMPORTANT</H1> words. Maybe even a form <form method=post>...</form>
</CDATA>

My code below will parse the HTML fine, but I lose the HTML markup between the <CDATA>...</CDATA> tags.

How can I retain the HTML markup between these <CDATA> tags??

my $p = HTML::TokeParser->new( \$sample_HTML ); while (my $token = $p->get_tag('cdata')){ my $text = $p->get_trimmed_text("/cdata"); print "Found Data: $text\n"; }
This code returns all HTML stripped out - Not good.

Found Data: Just some random whatever. It might have some real HTML like a table or CSS styling or even some IMPORTANT words. Maybe even a form ...

Replies are listed 'Best First'.
Re: HTML::TokeParser Frustration
by graff (Chancellor) on Oct 17, 2008 at 05:18 UTC
    I'm not sure what your ultimate goal is, but if you want to print out just the content of a "cdata" tag, with all other tags retained within that section of data, maybe something like this will do:
    #!/usr/bin/perl use strict; use HTML::TokeParser; my $sample_HTML = <<EOD; <HTML> blah. <CDATA> Just some random whatever. It might have some <b>real</b> HTML like a +table or CSS styling or even some <H1>IMPORTANT</H1> words. Maybe even a form <form method= +post>...</form> </CDATA> </HTML> EOD my $p = HTML::TokeParser->new( \$sample_HTML ); my $in_cdata = 0; while ( my $token = $p->get_token ) { my ( $tkn_type, $tkn_content, @rest ) = @$token; if ( $tkn_type =~ /[SE]/ ) { $tkn_content = pop @rest; # last array element is full tag st +ring } print $tkn_content if ( $in_cdata and $tkn_content !~ /cdata/ ); if ( $tkn_content =~ /cdata/i ) { $in_cdata += ( $tkn_type eq 'S' ) ? 1 : -1; } }
    That doesn't print the CDATA tags themselves, but it prints everything inside the CDATA tags, including other tags. To do that, the main loop has to process all "tokens" (all tags and all intervening text in the whole document) one token at a time, and a state variable has to keep track of when you're inside a cdata section as opposed to not being inside one.
Re: HTML::TokeParser Frustration
by wfsp (Abbot) on Oct 17, 2008 at 07:12 UTC
    graff has answered your question. fwiw I find HTML::TokeParser::Simple can, imo, help make this sort of task easier.
    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $sample_HTML = <<EOD; <HTML> blah. <CDATA> Just some random whatever. It might have some <b>real</b> HTML like a +table or CSS styling or even some <H1>IMPORTANT</H1> words. Maybe even a form <form method= +post>...</form> </CDATA> </HTML> EOD my $p = HTML::TokeParser::Simple->new(\$sample_HTML) or die qq{parse failed\n}; my ($in_cdata, $cdata); while (my $t = $p->get_token){ $in_cdata++, next if $t->is_start_tag(q{cdata}); $in_cdata--, next if $t->is_end_tag(q{cdata}); next unless $in_cdata; $cdata .= $t->as_is; } print $cdata;
    Just some random whatever. It might have some <b>real</b> HTML like a +table or CSS styling or even some <H1>IMPORTANT</H1> words. Maybe even a form <form method= +post>...</form>
Re: HTML::TokeParser Frustration
by Anonymous Monk on Oct 17, 2008 at 03:43 UTC
    get_trimmed_text (like get_text) returns text as documented, not markup. Try sticking with get_tag .
      Changing to get_tag yields...

      Found Data: ARRAY(0x81e326c)

        Instead of this:
        print "Found Data: $text\n";
        try this:
        use Data::Dumper qw/Dumper/; ... print Dumper( $text );
        As explained in the manual for HTML::TokeParser, the "get_tag" function returns a reference to an array, and Dumper is just an easy way to see what the array elements contain.