http://qs321.pair.com?node_id=717659


in reply to HTML::TokeParser Frustration

I'm not sure what your ultimate goal is, but if you want to print out just the content of a "cdata" tag, with all other tags retained within that section of data, maybe something like this will do:
#!/usr/bin/perl use strict; use HTML::TokeParser; my $sample_HTML = <<EOD; <HTML> blah. <CDATA> Just some random whatever. It might have some <b>real</b> HTML like a +table or CSS styling or even some <H1>IMPORTANT</H1> words. Maybe even a form <form method= +post>...</form> </CDATA> </HTML> EOD my $p = HTML::TokeParser->new( \$sample_HTML ); my $in_cdata = 0; while ( my $token = $p->get_token ) { my ( $tkn_type, $tkn_content, @rest ) = @$token; if ( $tkn_type =~ /[SE]/ ) { $tkn_content = pop @rest; # last array element is full tag st +ring } print $tkn_content if ( $in_cdata and $tkn_content !~ /cdata/ ); if ( $tkn_content =~ /cdata/i ) { $in_cdata += ( $tkn_type eq 'S' ) ? 1 : -1; } }
That doesn't print the CDATA tags themselves, but it prints everything inside the CDATA tags, including other tags. To do that, the main loop has to process all "tokens" (all tags and all intervening text in the whole document) one token at a time, and a state variable has to keep track of when you're inside a cdata section as opposed to not being inside one.