Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Why is this link parser only working the first time?

by dave_aiello (Pilgrim)
on Jan 19, 2001 at 06:03 UTC ( [id://52910]=perlquestion: print w/replies, xml ) Need Help??

dave_aiello has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a link parser that takes a URL and returns an array of links on the page that are within in the /attachments sub tree of a Web server's document tree. The subroutine parse_page works fine if it is only called one time. Subsequent invocations of the subroutine within the same program return nothing in the array, although I am sure that there are links that meet the criteria.

What I want to do is set up a main subroutine which grabs URLs from a database table, passes them one at a time to parse_page, and receives back a corresponding array of links. And to be perfectly clear, I want to do this serially.

I'm sorry if I have done something stupid. I have been sick all week, and I am working on less than my normal powers of abstract reasoning and concentration. I shamelessly leveraged the example code in the perldoc for HTML::LinkExtor and squinted at the screen for a while to get this far.

Dave Aiello
Chatham Township Data Corporation

#!/usr/local/bin/perl use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; sub parse_page { my($url) = @_; my $ua = LWP::UserAgent->new; my @links = (); sub attachment_link_extractor { my ($tag, %attr) = @_; push(@links, values %attr) if (($tag eq 'a') && ($attr{href} =~ m/attachments/)); } my $p = HTML::LinkExtor->new(\&attachment_link_extractor); $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); my $base = $res->base; @links = map { if ($_ =~ /^http/) { $_ = "/". url($_, $base)->rel; } else { $_ = $_; } } @links; $p->links; return(@links); }

Replies are listed 'Best First'.
Re: Why is this link parser only working the first time?
by dkubb (Deacon) on Jan 19, 2001 at 10:34 UTC
    Here's a possible solution:
    #!/usr/bin/perl -w use strict; use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; use HTTP::Request::Common qw(GET); my $links = parse_page('http://www.perl.com/'); sub parse_page { my $url = shift; #Get the base URL my $base = url($url)->abs->base; my @links; #Push any matching links onto the @link array #Only put a relative link, so we don't store #excessive amounts of data my $callback = sub { my $tag = shift; my %attr = @_; if($tag eq 'a' && $attr{href}[0] =~ $base) { push @links, url($attr{href}[0], $base)->abs->rel; } }; #Prepare the Link parser my $p = HTML::LinkExtor->new($callback, $base); #Fetch and Parse the web page my $ua = LWP::UserAgent->new; my $response = $ua->request(GET($url), sub {$p->parse($_[0])}); return \@links; }

    I'll explaination of what I did:

    I tried to limit the amount of data that is stored into the @links array. So instead of holding the entire url in the array, then shortening it later, I just made it relative inside the callback subroutine.

    Passing in the second argument to HTML::LinkExtor said to it "Add in this base url to any relative links automatically". This simplified the callback code, because every href link passed in was absolute, I didn't have to do any error checking similar to the map you used at the end of the original subroutine.

    Notice how I am accessing the first element inside $attr{href}? It's an arrayref. I think this is where you may have had problems, I know I did too, at first. But liberal use of Data::Dumper showed me the error in my ways.

    In order to simplify the LWP client block, I removed HTTP::Request. Instead I used a module called HTTP::Request::Common, which generally does the Right Thing, saving a few keystokes along the way.

    And finally, I returned an array reference instead of a real array. When you return an array from a subroutine, you are literally copying over each element to a new place in memory. Not to mention, that when @links goes out of scope, perl's garbage collector will be invoked, and could cause a performance penalty. By passing an array reference, I have just passed the location of the @links array back to the caller, not the actual information. It's much lighter than the entire @array, and quite speedy to pass around. Your milage may vary, but I believe it's a good habit to get into.

Re (tilly) 1: Why is this link parser only working the first time?
by tilly (Archbishop) on Jan 19, 2001 at 06:29 UTC
    If you had turned warnings on you would have receieved a message about "cannot stay shared". I tried to explain the error in RE (3): BrainPain-Help, I never got any feedback on it so I have no idea whether it is or isn't understandable.

    Anyways to fix your problem is simplicity, instead of writing:

    sub attachment_link_extractor { # ...
    write that as
    my $link_extractor = sub { # ...
    And then just pass the variable in to the constructor to LinkExtor.
      tilly:

      Thanks very much. This is an excellent solution because the code change is limited to a single line.

      What used to be:

      sub attachment_link_extractor { my ($tag, %attr) = @_; push(@links, values %attr) if (($tag eq 'a') && ($attr{href} =~ m/attachments/)); } my $p = HTML::LinkExtor->new(\&attachment_link_extractor);
      ... now becomes ...
      $attachment_link_extractor = sub { my ($tag, %attr) = @_; push(@links, values %attr) if (($tag eq 'a') && ($attr{href} =~ m/attachments/)); }; my $p = HTML::LinkExtor->new($attachment_link_extractor);
      The only part of the your suggested change that confused me was the need to put a semicolon after the bracket that closes the subroutine. I don't remember ever writing Perl that way before, but, I keep forgetting that the subroutine itself is part of an assignment statement.

      Dave Aiello
      Chatham Township Data Corporation

        Sorry, I should have remembered the semi-colon. That is the kind of detail I stopped thinking about, they just automatically go in when I write, and if I edit then run I expect to have Perl catch.

        You are right about why you need it. In the language used in perlsyn, the subroutine is now within a "simple statement" and you need to terminate the statement.

        As for writing Perl like that, I do it all of the time and recommend it whenever I get the chance. :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://52910]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-19 17:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found