Hi valerydolce,
Designing your own regex is certainly very good practice. For a production system I'd recommend existing modules, for example Regexp::Common to find URIs and URI to parse them. For example:
use warnings;
use strict;
my $str = <<'END_STR';
I am an example http://www.perlmonks.org/?parent=1176663;node_id=3333
+text
that contains <https://perlmonks.pair.com/?node_id=1176651> two URIs
END_STR
use Regexp::Common qw/URI/;
use URI;
while ($str=~/$RE{URI}{-keep}/g) {
my $uri = URI->new($1);
print "$uri\n";
print " Scheme: ", $uri->scheme, "\n";
print " Host: ", $uri->host, "\n";
print " Path: ", $uri->path, "\n";
print " Query: ", $uri->query, "\n";
}
See the URI documentation for lots more ways to access the different parts of the URI. I did notice that unfortunately Regexp::Common apparently doesn't match the #fragment part of the URI, so here's an attempt at an alternate solution, using a regex based on the characters allowed in URIs from RFC 3986.
# NOTE this is based on a quick skim of RFC 3986 and may not be comple
+te!
my $url_re = qr{
# https://tools.ietf.org/html/rfc3986#section-2
# URI = scheme ":" hier-part ...; hier-part = "//" ...
# scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
[A-Za-z][A-Za-z0-9+\-.]* ://
# gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
# sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
# / "*" / "+" / "," / ";" / "="
# unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
( [:/?#\[\]@!\$&'()*+,;=A-Za-z0-9\-._~]
# pct-encoded = "%" HEXDIG HEXDIG
| %[0-9A-Fa-f]{2} )*
}x;
while ($str=~/($url_re)/g) {
my $uri = URI->new($1);
print "$uri\n";
}
Hope this helps,
-- Hauke D