Regex for extracting a domain name from a string.

jnbek has asked for the wisdom of the Perl Monks concerning the following question:

I have what's probably a simple question for my fellow monks, but Regular Expressions is one of my weaknesses, I am just unable to wrap my head around anything more than only the basics.

my $var =~ m/\w/i;
[download]

Thus poses my problem. I need a rather complicated regex, I need to be able to extract a domain name from a string which could be anything from a full url:

http://www.perlmonks.org/?node=Seekers%20of%20Perl%20Wisdom
[download]

or an email address, or even a bare string:

www.sub.sub2.domain.com
domain.com
ftp.domain.co.uk
adsl-44-33-22-11.dsl.bcvloh.sbcglobal.net
[download]

and I actually would like it to return two results, provided the entered string was more than just domain.com. Using the last line as my example I would need the 2 results to be:

gcvloh.sbcglobal.net
sgcglobal.net
[download]

also, I'd need to make sure that if an international domain name or URL were given, it checked for it and returned:

some.domain.com.au
domain.com.au
[download]

Again provided the string was more than just domain.com.br and if only the bare minimum was entered:

domain.com
domain.co.uk
domain.fm
domain.name
..etc, etc..
[download]

Now I've searched and read a couple of nodes here, that are very similar to this question, but aren't quite enough for me to work with to achieve my goal. One splits up a domain name domain.com to extract domain and the other only focuses on http:// URLs only, and I've Searched Google and the results I've found again don't quite give enough for me to work with, as I am rather dense when it comes to regex.

Many Thanks Fellow Monks,

jnbek

=== Update ===

Looks like actually I have been looking at this from the wrong angle. I have managed to make myself feel like the silly n00b that I am. I only need a regex to strip off extra characters from the front and back, basically between the /'s. Strip off http://|ftp:// etc, then strip the right end / or ? or # then use the pop() function a couple times with a join to get the domain name. So, be it sub1.sub2.sub3.foo.bar.www.domain.com or domain.com I get domain.com to work with. I've only got initial test code with the pop() usage:

 my $d = "spam.yomama.www.zoelife4u.org";
 my @domain = split(/\./, $d);

 my $tld = pop(@domain); #org
 my $baredomain = pop(@domain); #zoelife4u

 my @result = ( $baredomain, $tld );
 $maindomain = join("\.", @result);
 
 print "End: $maindomain\n;"
[download]

And I think I've found a useful regex to work with here. Based on this, anyone have any critique?

Comment on Regex for extracting a domain name from a string. Select or Download Code

Replies are listed 'Best First'.
Re: Regex for extracting a domain name from a string. by Cody Pendant (Prior) on Feb 28, 2008 at 03:25 UTC
Short answer, this problem is too difficult for "a regex" to handle. You need a module. Nobody says perl looks like line-noise any more kids today don't know what line-noise IS ...	[reply]
Re^2: Regex for extracting a domain name from a string. by sundialsvc4 (Abbot) on Feb 28, 2008 at 03:45 UTC
Indeed. My take on the matter is ... what this person wants to get is: “this-or-that part of ‘a domain name.’” And it so happens that the “obvious way” to get there is... “well, (obviously) I have to (figure out how to) write a regular expression” before I'll be able to “get that.” And I'm reaching over and hitting the Pause button and saying... “oh, really?” I'll bet that there's a very good CPAN-module already out there which will give you the answer that you want. I'll wager that in fact you don't have to tackle regular-expressions in order to get the result you want, because somebody out there has already done it and has done it very well. Dictum Ne Agas — “Do Not Do A Thing Already Done.”	[reply]
Re^3: Regex for extracting a domain name from a string. by Cody Pendant (Prior) on Feb 28, 2008 at 03:51 UTC
For the "this or that part", they need a different module which can extract the parts like host, path, query string, port etc. etc. Nobody says perl looks like line-noise any more kids today don't know what line-noise IS ...	[reply]
Re^2: Regex for extracting a domain name from a string. by jnbek (Scribe) on Feb 28, 2008 at 05:37 UTC
Thank you for pointing me to the modules, I think those will work great. Between URI::Find and URI::Find::Schemeless I 'should' be able to get done what I need. I'll let you know my results. Thanks, jnbek	[reply]
Re: Regex for extracting a domain name from a string. by sundialsvc4 (Abbot) on Feb 28, 2008 at 03:22 UTC
Is there a CPAN module that will give you the result that you want, without you having to create a regular-expression at all? Focus on the result that you want to obtain...	[reply]
Re: Regex for extracting a domain name from a string. by kyle (Abbot) on Feb 28, 2008 at 04:10 UTC
I don't have a ready-made solution for you. For this kind of problem, I usually can recommend some part of Regexp::Common, but it doesn't seem to have a module for this. It sounds as if you want `sub.company.domx.domy` as well as just `company.domx.domy` where `domx` is optional, depending on what `domy` is. I'm not sure what the allowable alphabet for domain parts is, but for now I'll say `$dom = qr{ [a-z] # starts with a letter [a-z0-9-]* # zero or more letters, numbers, hyphens }xi;` [download] From there, get a list of Internet top-level domains. Look through and see which of those you want an "extra" domain for. For `au`, for example, you want `$au = qr{ $dom \. # company $dom \. # subdomain au # ccTLD \. ? # optional trailing dot \b # word break }xi;` [download] But for the mighty `.com`, it's just `$com = qr{ $dom \. # company com # gTLD \. ? # optional trailing dot \b # word break }xi;` [download] The word break at the end keeps us from matching `i.am.a.silly.com.administrator.at.example.com` as `silly.com` instead of `example.com`. This might not be the right thing for your input, though. It might be better to say `(?! $dom)` there instead (negative look ahead to confirm there's no more `$dom` stuff). The optional trailing dot allows "`example.com.`" as well as "`example.com`" (both are "legal"). Given these, you can make `my $domain = qr{ (?: # start of group $dom \. # company subdomain ) ? # end of group, make it optional (?: # start of group $com # .com pattern \| # or $au # .au pattern ) # end of group }xi;` [download] Having written all this and then done some testing, I find that this is harder than I thought (I should have known!). I'll stop now and provide what I got so far along with the tests that show it doesn't work. Expanding the testing to more interesting cases once you have a solution should be easy. Read more... (2 kB) One problem revealed by the tests is that the pattern will be happy to match merely `example.com` when faced with `example.com.au`. Also, it will match domains with illegal characters in them, basically by pretending that the name ends at the illegal character. For example `illegal_underscore.com` matches as `underscore.com`. Depending on your application, that may be acceptable, but I don't like it much.	[reply] [d/l] [select]
Re: Regex for extracting a domain name from a string. by igelkott (Priest) on Feb 28, 2008 at 04:09 UTC
Get your hostnames with a module like URI::Heuristic or a homemade (buggy) regex like `($host) = $uri =~ /([\w]+\.[\w]+\.[\w\.]+)/;` Then count the components with `$dots = ($host =~ y/\.//);` and make decisions on how much to print based on the number of dots you find. PS: Yes, I know that regex was weak but it's just a simple example to get started. A proper module is of course suggested to have any sort of reliability for this complicated of a problem.	[reply] [d/l] [select]
Re: Regex for extracting a domain name from a string. by Ellipsis (Novice) on Feb 28, 2008 at 06:00 UTC
`#!/usr/bin/perl -w use strict; use Data::Dumper; my $url = shift @ARGV \|\| 'http://www.ietf.org/rfc/rfc2396.txt'; # [Page 28] "Parsing a URI Reference with a Regular Expression" if (my @parts = ($url =~ m{ ^(([^:/?#]+):)? #1-2 scheme (//([^/?#]))? #3-4 authority ([^?#]) #5 path (\?([^#]))? #6-7 query (\#(.))? #8-9 fragment }x)) { print Dumper(\@parts); }` [download]	[reply] [d/l]


laziness, impatience, and hubris
	PerlMonks