Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Finding short DNS names from long text

by dwhite20899 (Friar)
on Jul 12, 2009 at 19:44 UTC ( [id://779395]=perlquestion: print w/replies, xml ) Need Help??

dwhite20899 has asked for the wisdom of the Perl Monks concerning the following question:

I've been looking on CPAN, googling, etc. and I can't find anything on this topic, so I'm hoping someone here can share a flash of brilliance.

I have a list of about 100 companies, and I need to find the short registered names from DNS for each of them. For example, I have "Adobe Systems, Inc." - since they own "adobe.com" the result is "adobe". "American Power Conversion" should result in "apc".

This needs to be done just once, so I'm tempted to just put a person on it, but if there is a way to automate it, I'd rather do that.

Any insights are appreciated! Doug

Update: Got something working.

It needs some work, but this gets me dang close.

Given a file with the NVD known vendor strings, and a file with the NSRL manufacturer text strings, this gets 6 URLs from Yahoo, strips them down and gives suggested mappings in square brackets, known NVD vendor strings in curly braces and other context, tab separated.

Output looks like this:

BeLight Software [belight] [belightsoft] en.wikipedia BeLight Software, Ltd. [belightsoft] go.cadwire macs.abou +t Bea Systems, Inc. [beasys] {bea} {oracle} Beermat Software Ltd. [beermatsoftware] encarta.msn Belkin Corp [belkin] bizjournals cnet updates.zdnet Bell Atlantic Internet Solutions Inc. [bellatlantic] prnewsw +ire verizon yale.edu Berkley Systems berkeley best.me.berkeley.edu bt-systems + bvsystems en.wikipedia gis.co.berkeley.sc.us Bethesda SoftWorks bethsoft elderscrolls support.bethsoft Big Fish Games [bigfishgames] atlantis.bigfishgames bigfi +sh.es Big Fish Games, Inc. [bigfishgames] bigfish.es bigfishgam +es.es otg.bigfishgames BioWare [bioware] blog.bioware nwn.bioware store.biowa +re
The code:
#!/opt/local/bin/perl -w use strict; use Yahoo::Search; use vars qw( %nvdVendor $textName %foundName @Results ); open(VIN,"NVD-vendors.txt") or die "$0 : cant open support file NVD-ve +ndors.txt\n"; while(<VIN>) { chomp; $nvdVendor{$_} = 1; } close(VIN); open(NIN,"NSRL-manufacturers.txt") or die "$0 : cant open support file + NSRL-manufacturers.txt\n"; while(<NIN>) { chomp; $textName = $_; (defined $textName) or next; @Results = Yahoo::Search->Results(Doc => "$textName", AppId => "Ya +hooDemo", Count => 6, Mode => 'all'); warn $@ if $@; # report any errors for my $Result (@Results) { addFullName($Result->Url); } print "$textName\t\t"; my %guesses; for my $k (keys %foundName) { if (defined $nvdVendor{$k}) { $guesses{"\{$k\}"} = 1; } else { if (closeEnuff($textName, $k)) { $guesses{"[$k]"} = 1; } else { $guesses{$k} = 1; } } delete $foundName{$k}; } print join("\t", (sort keys %guesses)); print "\n"; sleep(3); } # NIN exit; sub addFullName () { my $url = shift; $url = lc($url); ($url =~ /^http:/ ) or return(0); $url =~ s/http:\/\/// ; (my $n, my $p) = split(/\//,$url,2); # get the server name $n =~ s/^www\.// ; # strip common pre/postfixes $n =~ s/\.com$// ; $n =~ s/\.net$// ; $n =~ s/\.org$// ; $n =~ s/\.co\.uk$// ; $foundName{$n} += 1; return(1); } sub closeEnuff() { my $t = shift; my $y = shift; if ($t =~ / $y /i ) { return(1); } # does the candidate match a wo +rd in the text name? $t =~ s/ //g ; if ($t =~ /$y/i ) { return(1); } # does the candidate match the te +xt name with spaces removed? # should do a check after removing special chars return(0); } __END__

Replies are listed 'Best First'.
Re: Finding short DNS names from long text
by merlyn (Sage) on Jul 12, 2009 at 20:03 UTC
    I think you're having trouble because your question is incomplete. You presume that there's "one" "domain" for each company. I can assure you that there isn't. And the question is also context-sensitive, since it would depend on what country you are in. Also, "adobe" is useless without the ".com", so you'll need to keep the entire name.

    The more you can narrow down your question to the point where it would have the single answers you gave, the more likely you are at coming up with a solution that fulfills an answerable question. Start there.

    For example, you could use the Yahoo::Search module and identify the top web hit for each of the companies you list. That would more than likely be correct, but if some company's own website is less trafficked than some other site that talks about them, you might be in for a surprise. But that's where your "human" could come in... present the top five hits on a clickable web interface, and let your human assist in selecting the name.

    -- Randal L. Schwartz, Perl hacker

    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

      You're correct, I left that (multiple results) out.

      To be extremely specific, I want to mimic the vendor name part of the MITRE Common Platform Enumeration found on page 10 of http://cpe.mitre.org/files/cpe-specification_2.2.pdf

      They deal with multiple results by using the shortest string. They also deal with saving ".com" or ".org" in certain cases.

      It's certainly got it's shortcomings, but I want to sync up with http://nvd.nist.gov .

      Yahoo::Search looks good!

        "To be extremely specific, I want to mimic the vendor name part of the MITRE Common Platform Enumeration found on page 10 of http://cpe.mitre.org/files/cpe-specification_2.2.pdf "

        FWIW: That was how you should have asked the question in the first place.

        Knowing only what 30 seconds of googling has told me about CPE, wouldn't the best way to mimic CPE's behavior be to directly use the Official CPE Database of names?

Re: Finding short DNS names from long text
by mzedeler (Pilgrim) on Jul 12, 2009 at 20:28 UTC

    It sounds like you're asking for the "I'm feeling lucky" button from Google.

    Why don't you grab a useragent module and give it a (w)hack?

      I did think of that as a "first whack" to winnow down a list for a person to review, or to find a substring in the URL that matches a whole word in the long text name.

      That may be the quickest way to get some results.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://779395]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2024-03-29 01:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found