quite SOLVED Re^4: parsing html

Thanks...I read it just now :-) and tried this

#!/usr/local/bin/perl

use warnings;
use strict;
use LWP::Simple;
use HTML::TreeBuilder;
my @files = (["http://microrna.sanger.ac.uk/cgi-bin/targets/v5/detail_
+view.pl?transcript_id=ENST00000226253", "a.txt"],);  
for my $duplet (@files) {
    mirror($duplet->[0], $duplet->[1]);
};
open DATA, 'a.txt';
my $html = do{local $/;<DATA>};
my $p = HTML::TreeBuilder->new;
$p->parse_content($html); # parse_content if you have a string

my @tds = $p->look_down(_tag => q{td}); # get a list of all the td tag
+s
for my $td (@tds){
  my $bold = $td->look_down(_tag => q{b}); # look for a bold tag
  if ($bold){
    print $bold->as_text, qq{\n}; # if there is one print the text
  }
}
$p->delete; # when you've finished with it
[download]

it print:

Gene Name
Gene Name
Transcript
Gene
Description
Alignment View
Hit infomation
mmu-miR-705
mmu-miR-705
hsa-let-7d
hsa-let-7e
hsa-miR-483-5p
mmu-miR-683
hsa-miR-650
hsa-miR-920
mmu-miR-709
hsa-miR-26b*
hsa-miR-185
hsa-let-7a
hsa-miR-765
hsa-miR-629*
hsa-miR-19b-2*
hsa-miR-31
mmu-miR-707
hsa-miR-665
hsa-miR-339-5p
hsa-let-7c
hsa-let-7b
hsa-miR-7
hsa-miR-26b*
hsa-let-7g
hsa-miR-382
hsa-miR-454*
hsa-miR-501-5p
mmu-miR-666-5p
hsa-miR-486-3p
hsa-let-7f
mmu-miR-680
hsa-miR-219-2-3p
hsa-miR-153
hsa-miR-26a-2*
hsa-miR-328
hsa-miR-220c
hsa-miR-19a*
hsa-miR-433
hsa-miR-769-5p
hsa-miR-26b*
hsa-miR-19a*
hsa-miR-19b-1*
hsa-miR-25*
hsa-miR-483-5p
mmu-miR-685
hsa-miR-938
mmu-miR-465a-3p
hsa-miR-139-3p
hsa-miR-187*
mmu-miR-687
Features
[download]

It's very wonderful !!!!!!!but If I wanna refine......if I want only print the string with miR or let...so not features, gene etc...I tried to use regular expression:

 #!/usr/local/bin/perl

use warnings;
use strict;
use LWP::Simple;
use HTML::TreeBuilder;
my @files = (["http://microrna.sanger.ac.uk/cgi-bin/targets/v5/detail_
+view.pl?transcript_id=ENST00000226253", "a.txt"],);  
for my $duplet (@files) {
    mirror($duplet->[0], $duplet->[1]);
};
open DATA, 'a.txt';
my $html = do{local $/;<DATA>};
my $p = HTML::TreeBuilder->new;
$p->parse_content($html); # parse_content if you have a string

my @tds = $p->look_down(_tag => q{td}); # get a list of all the td tag
+s
for my $td (@tds){
  my $bold = $td->look_down(_tag => q{b}); # look for a bold tag
  if ($bold=~ m/miR/ || $bold=~ m/let/){
    print $bold->as_text, qq{\n}; # if there is one print the text
  }
}
$p->delete; # when you've finished with it
[download]

but it gives me the error mess "Use of uninitialized value in pattern match (m//) at test.pl line 19, <DATA> line 1.

so I have the last 2 question, to ask to monks....for today :-) : 1)shall I have to download the content of the web page...to work with filehandle DATA, this is the only way I find to make it works...2) the second question is: how to refine my script to make it prints only the data I need...thanks you all, you are essential for Perl community, and for my bioinformatics work....thanks

Comment on quite SOLVED Re^4: parsing html Select or Download Code

Replies are listed 'Best First'.
Re: quite SOLVED Re^4: parsing html by wfsp (Abbot) on May 15, 2009 at 09:28 UTC
Your earlier post included something like: `my $url3="http://microrna.sanger.ac.uk/blah/blah"; my $content=get $url3;` [download] This give you a string in `$content` that you can supply to `$p->parse_content($content);`. I only used the special perl `<DATA>` file handle for the purposes of the example (so I could easily get a string of HTML). You won't need to do this as that is what `LWP::Simple`'s `get` gives you. You need to use the regex on the text, so something like this might do it (untested): `for my $td (@tds){ my $bold = $td->look_down(_tag => q{b}); # look for a bold tag next unless $bold; my $txt = $bold->as_text; if ($txt=~ m/miR\|let/){ print $txt, qq{\n}; # if there is one print the text } }` [download] Hope that helps	[reply] [d/l] [select]
Re^2: quite SOLVED Re^4: parsing html by paola82 (Sexton) on May 15, 2009 at 10:19 UTC
if I understand correctly, I can do something like this, to parse without download the web page... #!/usr/local/bin/perl use warnings; use strict; use LWP::Simple; my $url="http://microrna.sanger.ac.uk/cgi-bin/targets/v5/detail_view.p +l?transcript_id=ENST00000226253"; my $content=get ($url); use HTML::TreeBuilder; my $p = HTML::TreeBuilder->new; $p->parse_content($content); # parse_content if you have a string my @tds = $p->look_down(_tag => q{td}); # get a list of all the td tag +s for my $td (@tds){ my $bold = $td->look_down(_tag => q{b}); # look for a bold tag if ($bold){ print $bold->as_text, qq{\n}; # if there is one print the text } } $p->delete; # when you've finished with it [download] but I don't understand why it doesn't give back me nothing, it seems as the content of the page has no bold string...that impossible...I see them and If I download the page like before and then do the parsing...it works...could you explain me why :-(... thanks too much	[reply] [d/l]
Re^3: quite SOLVED Re^4: parsing html by wfsp (Abbot) on May 15, 2009 at 11:00 UTC
...nothing... Same here. :-( I had more luck with LWP::UserAgent though #!/usr/bin/perl use warnings; use strict; use LWP::UserAgent; use HTML::TreeBuilder; my $url = q{http://microrna.sanger.ac.uk/cgi-bin/targets/v5/detail_vie +w.pl?transcript_id=ENST00000226253}; my $ua = LWP::UserAgent->new; $ua->timeout(10); my $response = $ua->get($url); my $content; if ($response->is_success) { $content = $response->content; } else { die $response->status_line; } my $p = HTML::TreeBuilder->new; $p->parse_content($content); my @tds = $p->look_down(_tag => q{td}); for my $td (@tds){ my $bold = $td->look_down(_tag => q{b}); next unless $bold; my $txt = $bold->as_text; if ($txt =~ /miR\|let/){ print $txt, qq{\n}; } } $p->delete; [download] mmu-miR-705 mmu-miR-705 hsa-let-7d hsa-let-7e hsa-miR-483-5p mmu-miR-683 hsa-miR-650 hsa-miR-920 mmu-miR-709 hsa-miR-26b* hsa-miR-185 hsa-let-7a hsa-miR-765 hsa-miR-629* hsa-miR-19b-2* hsa-miR-31 mmu-miR-707 hsa-miR-665 hsa-miR-339-5p hsa-let-7c hsa-let-7b hsa-miR-7 hsa-miR-26b* hsa-let-7g hsa-miR-382 hsa-miR-454* hsa-miR-501-5p mmu-miR-666-5p hsa-miR-486-3p hsa-let-7f mmu-miR-680 hsa-miR-219-2-3p hsa-miR-153 hsa-miR-26a-2* hsa-miR-328 hsa-miR-220c hsa-miR-19a* hsa-miR-433 hsa-miR-769-5p hsa-miR-26b* hsa-miR-19a* hsa-miR-19b-1* hsa-miR-25* hsa-miR-483-5p mmu-miR-685 hsa-miR-938 mmu-miR-465a-3p hsa-miR-139-3p hsa-miR-187* mmu-miR-687 [download] I'm no expert on LWP, perhaps the timeout? It took a while to download.	[reply] [d/l] [select]
Re^4: quite SOLVED Re^4: parsing html by paola82 (Sexton) on May 15, 2009 at 11:26 UTC


We don't bite newbies here... much
	PerlMonks