Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^3: parsing html

by wfsp (Abbot)
on May 14, 2009 at 17:06 UTC ( [id://764127]=note: print w/replies, xml ) Need Help??


in reply to Re^2: parsing html
in thread parsing html

Nearly! :-)
!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder; my $html = do{local $/;<DATA>}; my $p = HTML::TreeBuilder->new; $p->parse_content($html); # parse_content if you have a string my @tds = $p->look_down(_tag => q{td}); # get a list of all the td tag +s for my $td (@tds){ my $bold = $td->look_down(_tag => q{b}); # look for a bold tag if ($bold){ print $bold->as_text, qq{\n}; # if there is one print the text } } $p->delete; # when you've finished with it

Replies are listed 'Best First'.
quite SOLVED Re^4: parsing html
by paola82 (Sexton) on May 15, 2009 at 09:10 UTC

    Thanks...I read it just now :-) and tried this

    #!/usr/local/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TreeBuilder; my @files = (["http://microrna.sanger.ac.uk/cgi-bin/targets/v5/detail_ +view.pl?transcript_id=ENST00000226253", "a.txt"],); for my $duplet (@files) { mirror($duplet->[0], $duplet->[1]); }; open DATA, 'a.txt'; my $html = do{local $/;<DATA>}; my $p = HTML::TreeBuilder->new; $p->parse_content($html); # parse_content if you have a string my @tds = $p->look_down(_tag => q{td}); # get a list of all the td tag +s for my $td (@tds){ my $bold = $td->look_down(_tag => q{b}); # look for a bold tag if ($bold){ print $bold->as_text, qq{\n}; # if there is one print the text } } $p->delete; # when you've finished with it

    so I have the last 2 question, to ask to monks....for today :-) : 1)shall I have to download the content of the web page...to work with filehandle DATA, this is the only way I find to make it works...2) the second question is: how to refine my script to make it prints only the data I need...thanks you all, you are essential for Perl community, and for my bioinformatics work....thanks

      Your earlier post included something like:
      my $url3="http://microrna.sanger.ac.uk/blah/blah"; my $content=get $url3;
      This give you a string in $content that you can supply to $p->parse_content($content);.

      I only used the special perl <DATA> file handle for the purposes of the example (so I could easily get a string of HTML). You won't need to do this as that is what LWP::Simple's get gives you.

      You need to use the regex on the text, so something like this might do it (untested):

      for my $td (@tds){ my $bold = $td->look_down(_tag => q{b}); # look for a bold tag next unless $bold; my $txt = $bold->as_text; if ($txt=~ m/miR|let/){ print $txt, qq{\n}; # if there is one print the text } }
      Hope that helps

        if I understand correctly, I can do something like this, to parse without download the web page...

        #!/usr/local/bin/perl use warnings; use strict; use LWP::Simple; my $url="http://microrna.sanger.ac.uk/cgi-bin/targets/v5/detail_view.p +l?transcript_id=ENST00000226253"; my $content=get ($url); use HTML::TreeBuilder; my $p = HTML::TreeBuilder->new; $p->parse_content($content); # parse_content if you have a string my @tds = $p->look_down(_tag => q{td}); # get a list of all the td tag +s for my $td (@tds){ my $bold = $td->look_down(_tag => q{b}); # look for a bold tag if ($bold){ print $bold->as_text, qq{\n}; # if there is one print the text } } $p->delete; # when you've finished with it

        but I don't understand why it doesn't give back me nothing, it seems as the content of the page has no bold string...that impossible...I see them and If I download the page like before and then do the parsing...it works...could you explain me why :-(... thanks too much

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://764127]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (8)
As of 2024-04-25 15:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found