Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Regex for extracting a domain name from a string.

by jnbek (Scribe)
on Feb 28, 2008 at 02:33 UTC ( [id://670802]=perlquestion: print w/replies, xml ) Need Help??

jnbek has asked for the wisdom of the Perl Monks concerning the following question:

I have what's probably a simple question for my fellow monks, but Regular Expressions is one of my weaknesses, I am just unable to wrap my head around anything more than only the basics.

my $var =~ m/\w/i;

Thus poses my problem. I need a rather complicated regex, I need to be able to extract a domain name from a string which could be anything from a full url:

http://www.perlmonks.org/?node=Seekers%20of%20Perl%20Wisdom

or an email address, or even a bare string:

www.sub.sub2.domain.com domain.com ftp.domain.co.uk adsl-44-33-22-11.dsl.bcvloh.sbcglobal.net

and I actually would like it to return two results, provided the entered string was more than just domain.com. Using the last line as my example I would need the 2 results to be:

gcvloh.sbcglobal.net sgcglobal.net

also, I'd need to make sure that if an international domain name or URL were given, it checked for it and returned:

some.domain.com.au domain.com.au

Again provided the string was more than just domain.com.br and if only the bare minimum was entered:

domain.com domain.co.uk domain.fm domain.name ..etc, etc..

Now I've searched and read a couple of nodes here, that are very similar to this question, but aren't quite enough for me to work with to achieve my goal. One splits up a domain name domain.com to extract domain and the other only focuses on http:// URLs only, and I've Searched Google and the results I've found again don't quite give enough for me to work with, as I am rather dense when it comes to regex.

Many Thanks Fellow Monks,

jnbek

=== Update ===

Looks like actually I have been looking at this from the wrong angle. I have managed to make myself feel like the silly n00b that I am. I only need a regex to strip off extra characters from the front and back, basically between the /'s. Strip off http://|ftp:// etc, then strip the right end / or ? or # then use the pop() function a couple times with a join to get the domain name. So, be it sub1.sub2.sub3.foo.bar.www.domain.com or domain.com I get domain.com to work with. I've only got initial test code with the pop() usage:

my $d = "spam.yomama.www.zoelife4u.org"; my @domain = split(/\./, $d); my $tld = pop(@domain); #org my $baredomain = pop(@domain); #zoelife4u my @result = ( $baredomain, $tld ); $maindomain = join("\.", @result); print "End: $maindomain\n;"
And I think I've found a useful regex to work with here. Based on this, anyone have any critique?

Replies are listed 'Best First'.
Re: Regex for extracting a domain name from a string.
by Cody Pendant (Prior) on Feb 28, 2008 at 03:25 UTC
    Short answer, this problem is too difficult for "a regex" to handle. You need a module.



    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...

      Indeed. My take on the matter is ... what this person wants to get is:   “this-or-that part of ‘a domain name.’” And it so happens that the “obvious way” to get there is... “well, (obviously) I have to (figure out how to) write a regular expression” before I'll be able to “get that.”

      And I'm reaching over and hitting the Pause button and saying... “oh, really?” I'll bet that there's a very good CPAN-module already out there which will give you the answer that you want. I'll wager that in fact you don't have to tackle regular-expressions in order to get the result you want, because somebody out there has already done it and has done it very well.

      Dictum Ne Agas — “Do Not Do A Thing Already Done.”

        For the "this or that part", they need a different module which can extract the parts like host, path, query string, port etc. etc.


        Nobody says perl looks like line-noise any more
        kids today don't know what line-noise IS ...

      Thank you for pointing me to the modules, I think those will work great. Between URI::Find and URI::Find::Schemeless I 'should' be able to get done what I need. I'll let you know my results.

      Thanks,

      jnbek
Re: Regex for extracting a domain name from a string.
by sundialsvc4 (Abbot) on Feb 28, 2008 at 03:22 UTC

    Is there a CPAN module that will give you the result that you want, without you having to create a regular-expression at all?

    Focus on the result that you want to obtain...

Re: Regex for extracting a domain name from a string.
by kyle (Abbot) on Feb 28, 2008 at 04:10 UTC

    I don't have a ready-made solution for you. For this kind of problem, I usually can recommend some part of Regexp::Common, but it doesn't seem to have a module for this.

    It sounds as if you want sub.company.domx.domy as well as just company.domx.domy where domx is optional, depending on what domy is.

    I'm not sure what the allowable alphabet for domain parts is, but for now I'll say

    $dom = qr{ [a-z] # starts with a letter [a-z0-9-]* # zero or more letters, numbers, hyphens }xi;

    From there, get a list of Internet top-level domains. Look through and see which of those you want an "extra" domain for. For au, for example, you want

    $au = qr{ $dom \. # company $dom \. # subdomain au # ccTLD \. ? # optional trailing dot \b # word break }xi;

    But for the mighty .com, it's just

    $com = qr{ $dom \. # company com # gTLD \. ? # optional trailing dot \b # word break }xi;

    The word break at the end keeps us from matching i.am.a.silly.com.administrator.at.example.com as silly.com instead of example.com. This might not be the right thing for your input, though. It might be better to say (?! $dom) there instead (negative look ahead to confirm there's no more $dom stuff).

    The optional trailing dot allows "example.com." as well as "example.com" (both are "legal").

    Given these, you can make

    my $domain = qr{ (?: # start of group $dom \. # company subdomain ) ? # end of group, make it optional (?: # start of group $com # .com pattern | # or $au # .au pattern ) # end of group }xi;

    Having written all this and then done some testing, I find that this is harder than I thought (I should have known!). I'll stop now and provide what I got so far along with the tests that show it doesn't work. Expanding the testing to more interesting cases once you have a solution should be easy.

    One problem revealed by the tests is that the pattern will be happy to match merely example.com when faced with example.com.au. Also, it will match domains with illegal characters in them, basically by pretending that the name ends at the illegal character. For example illegal_underscore.com matches as underscore.com. Depending on your application, that may be acceptable, but I don't like it much.

Re: Regex for extracting a domain name from a string.
by igelkott (Priest) on Feb 28, 2008 at 04:09 UTC
    Get your hostnames with a module like URI::Heuristic or a homemade (buggy) regex like
    ($host) = $uri =~ /([\w]+\.[\w]+\.[\w\.]+)/;

    Then count the components with $dots = ($host =~ y/\.//); and make decisions on how much to print based on the number of dots you find.

    PS: Yes, I know that regex was weak but it's just a simple example to get started. A proper module is of course suggested to have any sort of reliability for this complicated of a problem.

Re: Regex for extracting a domain name from a string.
by Ellipsis (Novice) on Feb 28, 2008 at 06:00 UTC
    #!/usr/bin/perl -w use strict; use Data::Dumper; my $url = shift @ARGV || 'http://www.ietf.org/rfc/rfc2396.txt'; # [Page 28] "Parsing a URI Reference with a Regular Expression" if (my @parts = ($url =~ m{ ^(([^:/?#]+):)? #1-2 scheme (//([^/?#]*))? #3-4 authority ([^?#]*) #5 path (\?([^#]*))? #6-7 query (\#(.*))? #8-9 fragment }x)) { print Dumper(\@parts); }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://670802]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2024-04-18 02:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found