Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Link Extraction when grabbing web page with USER/PASS

by cdherold (Monk)
on Mar 04, 2003 at 04:15 UTC ( [id://240234]=perlquestion: print w/replies, xml ) Need Help??

cdherold has asked for the wisdom of the Perl Monks concerning the following question:

monks,

alas, I am stymied once again ... and have humbly come for assistance.

I am trying to pull the links off a page and store them in @links. There is standard code for this which I have used with success.

my @links = (); sub callback { my($tag, %attr) = @_; return if $tag ne 'a'; push(@links, values %attr); + } # Make the parser. $p = HTML::LinkExtor->new(\&callback); # Request document and parse it as it arrives $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])});

Now, however, I am trying to get the links off a page that requires a username/password ... through the assistance of the monks I have accomplished a user/pass webpage grab...

$ua = LWP::UserAgent->new; $req = HTTP::Request->new(GET => $url); $req->authorization_basic('user', 'pass'); $res = $ua->request($req)->as_string,

Now the question is how to merge the user/pass webpage grab with the link extractor.

I have tried

$ua = LWP::UserAgent->new; $req = HTTP::Request->new(GET => $url); $req->authorization_basic('user', 'pass'); $res = $ua->request($req)->as_string, sub {$p->parse($_[0])};

but when I print out @links I get nothing. I think (but really have no clue) this has something to do with the ->as_string, but without it the webpage comes out as HTTP::Response=HASH(0x8435960).

Is there something else that I should be doing to get these links pulled out properly? Obviously there is, but do you guys know what that might be?

cdherold

Replies are listed 'Best First'.
Re: Link Extraction when grabbing web page with USER/PASS
by tachyon (Chancellor) on Mar 04, 2003 at 04:30 UTC

    Why bother with LinkExtor when you can just:

    use HTML::TokeParser; my $parser = HTML::TokeParser->new( \$content ); my @links; while ( my $token = $parser->get_tag(qw( a img )) ) { my $link = $token->[1]{href} || $token->[1]{src} || next; push @links, $link; }

    You will need to convert relative links to abolute if that is what you need. See Link Checker for more code.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      ok, so you could use either of those, but the problem is why can't i get anything out with either one? is it because my web page is grabbed as a string? if so how do i change that so that i can extract links?

        Eh? Get page as string, stick in $content.

        my $content = <<HTML; <a href="http://what.the.com">hello?</a> <a href="http://is.dis.org">hello?</a> <a href="http://your.net">hello?</a> <a href="http://problem">hello?</a> HTML use HTML::TokeParser; my $parser = HTML::TokeParser->new( \$content ); my @links; while ( my $token = $parser->get_tag(qw( a img )) ) { my $link = $token->[1]{href} || $token->[1]{src} || next; push @links, $link; } print "@links";

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Link Extraction when grabbing web page with USER/PASS
by zakb (Pilgrim) on Mar 04, 2003 at 09:08 UTC

    To get your idea working, you need to look at these lines of code:

    # (1) your working example $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); # compared with (2) $res = $ua->request($req)->as_string, sub {$p->parse($_[0])};

    Looking carefully at the bracketing in the second line, it appears it should be more like:

    $res = $ua->request($req, sub {$p->parse($_[0])});

    Your version was not passing the callback to call LinkExtor to the UserAgent request method. The call signature for the request method in the form you want it is:

    $response = $ua->request($request, \&callback);

    where $request is a HTTP::Request object and &callback is a sub or whatever.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://240234]
Approved by Thelonius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2024-04-26 02:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found