Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Ignoring specific html tags before parsing

by Anonymous Monk
on Oct 07, 2013 at 06:19 UTC ( [id://1057222]=note: print w/replies, xml ) Need Help??


in reply to Ignoring specific html tags before parsing

but could not find a way to specifically ignore certain tags. I

Well, XPath helps you only select what you want so if you want to ignore something, its simple to do, simply don't select it to begin with

  • Comment on Re: Ignoring specific html tags before parsing

Replies are listed 'Best First'.
Re^2: Ignoring specific html tags before parsing
by ganeshPerlStarter (Novice) on Oct 07, 2013 at 06:55 UTC
    >>its simple to do, simply don't select it to begin with then in that case, we need to list ALL those tags we're interested in. won't this endup in a long list? HTML::Parser has a method ignore_tags() which could be used to ignore tags. I used it as below & tried to get the text, but it returned many nested arrays. I could not figure out how to access to final extracted text from this "@array"
    my @array; my $p = HTML::Parser->new(api_version => 3, handlers => { text => [\@array, "text"]}); $p->ignore_tags(qw(table img)); $p->parse($page); print "Size of array=$#array\n"; foreach my $aline (@array) { print $aline; } print "\n";
    Meanwhile, I found an alternative, but seems it is quite slower than what we could have achieved with HTML::Parser.
    my $link = 'somelinek'; my $page = get($link) or die $!; my $stream = HTML::TokeParser->new(\$page); my $doparse = 1; ## 0 means don't parse while (my $token = $stream->get_token) { if ($token->[0] eq 'S') { if ($token->[1] eq 'table') { $doparse = 0; } elsif ($token->[1] eq 'img') { ;; } } elsif ($token->[0] eq 'E' and $token->[1] eq 'table') { $doparse = 1; } elsif ($token->[0] eq 'C') { ;; } elsif ($token->[0] eq 'T' and $doparse eq 1) { # text process the text in $token->[1] # skip: empty lines, " " if (defined ($token->[1])) { $token->[1] =~ s/ / /ig; $token->[1] =~ s/’/'/ig; $token->[1] =~ s/&#14[7-8];/"/ig; $token->[1] =~ s/—//ig; $token->[1] =~ s/&/&/ig; $token->[1] =~ s/-{2,}//ig; print "$token->[1]"; } } }
    This above use of TokeParser gives lot of broken text. Which could be better way? Thanks
      What is your actual goal?
        >>What is your actual goal? I want to ignore tables and img tags from html when parsing and getting embedded text from the html files.

      then in that case, we need to list ALL those tags we're interested in.

      Or you could select the ones you want and [id://1052072remove them from the tree]

        select the ones you don't want and remove them from tree with delete

        Thats it, 'm done for tonight

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1057222]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-04-19 01:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found