Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Saving a Pattern Match from Subroutine

by shoness (Friar)
on Jul 23, 2007 at 13:18 UTC ( [id://628232]=perlquestion: print w/replies, xml ) Need Help??

shoness has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

You can probably help me even if you don't know details of HTML::TreeBuilder or of my application. My question is more general Perl, but framed in my own world. As usual, I can solve this via a very brute force method, but "there has to be a better way".

I'm using HTML::TreeBuilder to grab the filename of an anchor that is an immediate child of a known tag. The structures look like:

... <span class=inst> <a href="file23.html#some_tag">...</a> </span> ...
I want to return a list that contains all the files from key sections like this. For example, the section above should simply "push(@list, "file23.html");"

The snippet below will find the span containing the anchor. I really don't care about that. I just want to push "$1" from the qr// search in the subroutine call into the list.

foreach $inst (@instances) { my @junk = $inst->look_down( 'tag' => 'span', 'class' => 'inst', sub { $_[0]->look_down( '_tag' => 'a', 'href' => qr/(\w+\.html)#\w/)); # ^^^^^^^^^ });
My brute force method is to find all the anchors that fit my pattern and look at their immediate parent to see if they have the "span" tag that interests me, pulling the filename out if it does.

Oh, and the subroutine itself must continue to return a "1" or the look_down() will stop at that point and I won't get all the files.

Thanks for your help!

Replies are listed 'Best First'.
Re: Saving a Pattern Match from Subroutine
by Corion (Patriarch) on Jul 23, 2007 at 13:32 UTC

    I really like using Web::Scraper for that, or rather, its method of using HTML::TreeBuilder::XPath and HTML::Selector::XPath to specify and extract tags:

    use strict; use Web::Scraper; use Data::Dumper; my $data = do { local $/; <DATA> }; # Weirdo syntax of Web::Scraper my $link = scraper { process 'a', href => '@href', text => 'TEXT'; result 'href', 'text'; }; my $scraper = scraper { process 'span.inst a', 'links[]' => $link; result 'links[]'; }; print Dumper $scraper->scrape($data); __DATA__ <html> <body> ... <span class=inst> <a href="file23.html#some_tag">aaa</a> </span> <span class=inst> <a href="file24.html#some_tag">bbb</a> </span> <span class=no_inst> <a href="file24.html#some_tag">(should not match either due to wron +g span class)</a> </span> <a href="file23.html#some_tag">a bare link (should not match)</a> </body> </html>

    Getting the syntax of Web::Scraper right isn't always straightforward (to me at least), but I hope that some better, non-code based, configurability will come soon.

Re: Saving a Pattern Match from Subroutine
by Ovid (Cardinal) on Jul 23, 2007 at 13:37 UTC

    Without knowing too much about the problem, it looks like you want something like this:

    my @files; foreach $inst (@instances) { my @junk = $inst->look_down( 'tag' => 'span', 'class' => 'inst', sub { my $result = $_[0]->look_down( '_tag' => 'a', 'href' => qr/(\w+\.html)#\w/)); push @files => $1 if $1; return $result; });

    Of course, this falls into the "wild-assed guess" category. It seems straightforward enough that I'm wondering if I've misunderstood something.

    Cheers,
    Ovid

    New address of my CGI Course.

      Ack! Thanks Ovid!
      push @files => $1 if $1;
      Thanks to Corian as well for the Web::Scraper pointer.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://628232]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-03-29 00:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found