Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^2: Crawling Relative Links from Webpages

by listanand (Sexton)
on May 08, 2010 at 01:34 UTC ( [id://838981]=note: print w/replies, xml ) Need Help??


in reply to Re: Crawling Relative Links from Webpages
in thread Crawling Relative Links from Webpages

Thanks for your reply.

Well OK. The point is how do you determine $url? In this case, the $url is "http://dspace.mit.edu" and it is not at all obvious from the webpage (looking at the source) how one would say that this is the server. I have a million different kinds of such webpages from different servers. I need a method that is generic enough to work with all of them.

Any suggestions anyone?

Andy

  • Comment on Re^2: Crawling Relative Links from Webpages

Replies are listed 'Best First'.
Re^3: Crawling Relative Links from Webpages
by BrowserUk (Patriarch) on May 08, 2010 at 01:42 UTC

    Something like:

    my $uri = $mech->uri; my( $server ) = $url =~ m[(^http://[^/]+)/]; ... my $pdfurl = $server . $link;

    Note: There probably is some way of getting the appropriate portion of the url from URI without resorting to regex, but I've never worked out how.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      uri returns a URI object, so $mech->uri->host or $mech->uri->ihost

        I know. But how do you get the bit the OP needs? Not like this:

        perl -MURI -E"$u=new URI('http://dspace.mit.edu/handle/1721.1/53720'); say $u->ho +st" dspace.mit.edu

        Nor any of these:

        c:\test>perl -MURI -E"my $u=new URI('http://dspace.mit.edu/handle/1721 +.1/53720'); say $u->authority" dspace.mit.edu c:\test>perl -MURI -E"my $u=new URI('http://dspace.mit.edu/handle/1721 +.1/53720'); say $u->path" /handle/1721.1/53720 c:\test>perl -MURI -E"my $u=new URI('http://dspace.mit.edu/handle/1721 +.1/53720'); say $u->fragment" c:\test>perl -MURI -E"my $u=new URI('http://dspace.mit.edu/handle/1721 +.1/53720'); say $u->opaque" //dspace.mit.edu/handle/1721.1/53720 c:\test>perl -MURI -E"my $u=new URI('http://dspace.mit.edu/handle/1721 +.1/53720'); say $u->canonical" http://dspace.mit.edu/handle/1721.1/53720

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://838981]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-24 23:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found