Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

URLs in plain text

by traveler (Parson)
on Nov 11, 2003 at 16:08 UTC ( [id://306232]=perlquestion: print w/replies, xml ) Need Help??

traveler has asked for the wisdom of the Perl Monks concerning the following question:

I strongly believe this has been done before, but I can't find any modules on CPAN or code here to do it (I probably searched on the wrong terms). I want to search plain text (not html) and extract all the URLs. It has been correctly said here many times that a regex is not good enough to parse URLs. True. So is there some module that works with plain text and not html?

Replies are listed 'Best First'.
Re: URLs in plain text
by jdtoronto (Prior) on Nov 11, 2003 at 16:34 UTC
    And then there is the module that does it all - I am a regex scaredy cat!

    URI::Find This module does one thing: Finds URIs and URLs in plain text. It finds them quickly and it finds them all(or what URI::URL considers a URI to be.) It only finds URIs which include a scheme (http:// or the like), for something a bit less strict have a look at URI::Find::Schemeless.

    jdtoronto

Re: URLs in plain text
by Abigail-II (Bishop) on Nov 11, 2003 at 16:24 UTC
    Regexes can parse many forms of URLs, including the most common ones. Here's a regex for HTTP URIs:
    (?:(?:http)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9]) +[.])*(?:[a -zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)|(?:[0-9]+[.][0-9]+[.][0- +9]+[.][0-9 ]+)))(?::(?:(?:[0-9]*)))?(?:/(?:(?:(?:(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@& +=+$,]+|(?: %[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?: +%[a-fA-F0- 9][a-fA-F0-9]))*))*)(?:/(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a- +fA-F0-9][a -fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a +-fA-F0-9]) )*))*))*))(?:[?](?:(?:(?:[;/?:@&=+$,a-zA-Z0-9\-_.!~*'()]+|(?:%[a-fA-F0 +-9][a-fA-F 0-9]))*)))?))?)
    Alternatively, you may want to use the Regexp::Common module:
    use Regexp::Common; print $&, "\n" while $txt =~ /$RE{URI}/g;

    Abigail

      Oh, my bleeding eyes!! ;) Perhaps an entry for some Obfuscation - lots of nested lookaheads and that's where I got lost.

      $RE{'URI'} for me!

        There shouldn't be any lookaheads in that regexp.

        Abigail

Re: URLs in plain text
by batkins (Chaplain) on Nov 11, 2003 at 16:39 UTC
    URI::Find works for me.
    Are you sure it was a book? Are you sure it wasn't.....nothing?
Re: URLs in plain text
by gjb (Vicar) on Nov 11, 2003 at 16:26 UTC

    Have a look at Regex::Common, there are a number of expressions to extract URLs.

    Hope this helps, -gjb-

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://306232]
Approved by gjb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2024-04-19 09:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found