Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Extracting Page Name

by JimStone (Initiate)
on Apr 26, 2012 at 23:01 UTC ( #967484=perlquestion: print w/replies, xml ) Need Help??

JimStone has asked for the wisdom of the Perl Monks concerning the following question:

I am new to Perl, am trying to write a simple program but am running into problems trying to grab a page name from a string that is a URl. In other words, for example a URL http://www.myspace/myplace/db/gladtidings.shtml in $url.

I want to extract just the gladtidings.shtml into another string $pagename. And would like it to work on any URL. I know it's something simple I'm missing but at am a loss right now. Any help would be greatly appreciated.

Replies are listed 'Best First'.
Re: Extracting Page Name
by ww (Archbishop) on Apr 26, 2012 at 23:16 UTC
    The something simple may be that the 'pagename' will follow the last slash.

    The catch: it may be followed by many options -- a colon, Note_1 if it's followed by a port number; a questionmark for several possible uses; and perhaps others that I'm blanking on just now. But regardless, the entity from the last slash, through a period to the next punctuation should be what you're looking for.

    And to broaden the hint a bit further, the regex documentation and tutorials here will show you precisely the way to obtain what you're looking for.

    Update: Note_1 See correction (+ + by quester immediately below. Aargh.

      ... a colon, if it's followed by a port number...

      Minor nit: The colon and port number is just after the hostname in a URL, not the page name. For example, consider the port 8080 in

      http://www.example.com:8080/pagename.html

      The question mark following the page name in a URL starts a list of parameters being passed from the browser to the script running in the server. The parameter values can be more or less anything; by convention spaces will have been replaced by plus signs, but otherwise almost anything goes, including colons. For example,

      http://www.example.com/filename.pl?credentials=myuserid:zomg_dont_send_passwords_in_the_clear

Re: Extracting Page Name
by choroba (Archbishop) on Apr 26, 2012 at 23:20 UTC
    You can use a regular expression. It matches non-slash characters up to the end of the URL.
    my ($pagename) = $url =~ m{([^/]+)$};

      ... but has a (greedy) failure mode:

      C:\>perl -e "my $url = 'http://www.perlmonks.com/index.pl?node_id=9674 +84'; my ($pagename) = $url =~ m{([^/]+)$}; print $pagename;" index.pl?node_id=967484 C:\>
      Thanks everyone for the quick replies. This regular expression is just what I needed to see what I was doing wrong.
Re: Extracting Page Name
by Marshall (Canon) on Apr 26, 2012 at 23:31 UTC
    There are many modules that can deal with web pages. A very easy one is LWP::Simple. There are many others! Get started, then show us code and where you are having troubles.

    Of course test the URL that you are trying to get by using your normal Web browser. If it can't "get" it, Perl can't either.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://967484]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2022-06-29 03:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (94 votes). Check out past polls.

    Notices?