http://qs321.pair.com?node_id=427227

existem has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I have a little problem and I want the most cool/efficient (but still understandable) way of solving it.

I basically have a long string, for example.

<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouseOver="ss('go to www.mysite.com')" onMouseOut="cs()" class=ch onClick="ga(this,event)"><a id=pa1 href=/pagead/iclk?adurl=http://www.mysite.com&sa=l&ai=BRqXfIssAQqj8EcfeRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZgBgkugAZDsuf8DyAEB onMouseOver="return ss('go to www.mysite.com')" onMouseOut="cs()"><b>MySite</b> wine merchant</a><font size=-1><br><span class=a>www.mysite2.com</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Award winning wines at great prices&nbsp;&nbsp;Money back satisfaction guarantee</font></td><td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align=right valign=top onMouseOver="ss('go to www.mysite2.com')" onMouseOut="cs()" onClick="ga(this,event)"><font size=-1 class=f>Sponsored Links</font></td></tr>

I want the best way to get all the domain names out of it.

So for example in the above I would get, http://www.mysite.com and http://www.mysite2.com as the output.

I have tried doing this with the LinkExtor.pm module from cpan, but i'm not sure if this is really necessary. Can I just do it with a regular expression? Reading the string in and storing all occurences of http://www.somedomain.com ?

Thank you, your advice is as ever very much appreciated.

Tom.

Replies are listed 'Best First'.
Re: Grepping out strings
by Zaxo (Archbishop) on Feb 02, 2005 at 13:13 UTC

    Assuming protocol and all, it's pretty easy with Regexp::Common:

    use Regexp::Common qw/URI/; my @results; while ($string =~/$RE{URI}{-keep}/g) { push @results, $1; }
    Your sample seems to omit the protocol and other bits that make a URI. Do you know they are all http? Try looking at the regex from R::C::URI and pick out the host-and-domain part. It will start with two literal slashes and run until the next slash or space, whichever comes first.

    After Compline,
    Zaxo

Re: Grepping out strings
by g0n (Priest) on Feb 02, 2005 at 13:11 UTC
    $_ =~ /(http:\/\/[\w\.]+)/g;

    seems to work OK for URLs starting http://, but that doesn't match www.mysite2.com as that doesn't have a http prefix in your example string.

    VGhpcyBtZXNzYWdlIGludGVudGlvbmFsbHkgcG9pbnRsZXNz
Re: Grepping out strings
by manigandans (Initiate) on Feb 02, 2005 at 13:12 UTC

    Hi Tom,

    Try out the following regular expression:

    my $data = "<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouseOv +er=ss('go to www.mysite.com') onMouseOut=cs() class=ch onClick=ga(thi +s,event)><a id=pa1 href=/pagead/iclk?adurl=http://www.mysite.com&sa=l&ai=BRqXfIssAQqj8Ecf +eRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZgBgugAZDsuf8DyAEB +onMouseOver=return ss('go to www.mysite.com') onMouseOut=cs()><b>MySi +te</b> wine merchant</a><font size=-1><br><span class=a>www.mysite2.c +om</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Award winning wines at g +reat prices&nbsp;&nbsp;Money back satisfaction guarantee</font></td>< +td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align=right valign=top +onMouseOver=ss('go to www.mysite2.com') onMouseOut=cs() onClick=ga(th +is,event)><font size=-1 class=f>Sponsored Links</font></td></tr>"; my @output = ($data =~ /www.\S+\.com/g); print join ("\n", @output);

    Mani.

    Edited by demerphq -- added code tags and basic markup
Re: Grepping out strings
by perlsen (Chaplain) on Feb 02, 2005 at 15:25 UTC

    Hi, just try this

    my $input = q(<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouse +Ov+er=ss('go to www.mysite.com') onMouseOut=cs() class=ch onClick=ga( +thi+s,event)><a id=pa1 href=/pagead/iclk?adurl=http://www.mysite.com&sa=l&ai=BRqXfIssAQqj8Ecf ++eRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZgBgugAZDsuf8DyAEB + +onMouseOver=return ss('go to www.mysite.com') onMouseOut=cs()><b>My +Si+te</b> wine merchant</a><font size=-1><br><span class=a>www.mysite +2.com</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Award winning wines a +t g+reat prices&nbsp;&nbsp;Money back satisfaction guarantee</font></ +td><+td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align=right valign +=top +onMouseOver=ss('go to www.mysite2.com') onMouseOut=cs() onClick +=ga(th+is,event)><font size=-1 class=f>Sponsored Links</font></td></t +r>); my @arr=(); while($input=~m#((http://)*www.+?\.\w+) ?'?#gsi) { push(@arr, $1); } print "$_\n" for @arr; #outputs in @arr: www.mysite.com http://www.mysite.com www.mysite.com www.mysite2.com www.mysite2.com
Re: Grepping out strings
by manigandans (Initiate) on Feb 02, 2005 at 13:19 UTC
    Hi Tom,

    Assume that $data has the long string you've specified and try out the following regexp:

    my @output = ($data =~ /www.\S+\.com/g);

    print join ("\n", @output);

    Mani
Re: Grepping out strings
by jpk236 (Monk) on Feb 02, 2005 at 14:22 UTC
    Tom, This might not be as fancy as some of the other suggestions, but it will work for both domains.
    #!/usr/bin/perl $data = "<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouseOver= +ss('go to www.mysite.com') onMouseOut=cs() class=ch onClick=ga(this,e +vent)><a id=pa1 href=/pagead/iclk?adurl=http://www.mysite.com&sa=l&ai +=BRqXfIssAQqj8EcfeRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZg +BgkugAZDsuf8DyAEB onMouseOver=return ss('go to www.mysite.com') onMou +seOut=cs()><b>MySite</b> wine merchant</a><font size=-1><br><span cla +ss=a>www.mysite2.com</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Award +winning wines at great prices&nbsp;&nbsp;Money back satisfaction guar +antee</font></td><td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align +=right valign=top onMouseOver=ss('go to www.mysite2.com') onMouseOut= +cs() onClick=ga(this,event)><font size=-1 class=f>Sponsored Links</fo +nt></td></tr>"; ($domain1) = $data =~ /=ss\('go to (\S*)'\).*$/; ($domain2) = $data =~ /^.*=ss\('go to (\S*)'\)/; $domain1 = "http://".$domain1; $domain2 = "http://".$domain2; print "Domain1 :: $domain1\n"; print "Domain2 :: $domain2\n";
    - Justin
Re: Grepping out strings
by Popcorn Dave (Abbot) on Feb 02, 2005 at 18:37 UTC
    In addition to looking at using a regex, you might also consider taking a look at HTML::TokeParser. If you're trying to parse a whole page, I think that HTML::TokeParser makes it much easier to find your data.

    Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
Re: Grepping out strings
by existem (Sexton) on Feb 02, 2005 at 15:07 UTC
    Thanks, i've kind of used a combination of just about every approach suggested and I think it has put me on the right track now.

    I also have to contend with .co.uk address and any other kind of domain and protocol other than http, so it's going to be a bit of a hack I think to get this to work exactly as I want it to, but thanks for the help this far.

    I don't suppose anybody knows which module has the effect of giving me just the domain out of a URL?

    So for example if I have.

    http://www.majestic.co.uk/webapp/wcs/stores/servlet/ReferrerEntryPoint%3FaffiliateId%3D1206%26redirect%3DContentView

    I just want to get http://www.majestic.co.uk/.

    But it could also be in the form.

    http://www.chilloutdrink.com&sa=l&ai=B1O33J-sAQtOwG7vORJGm4OQDg

    So I want to use a module that will take all different permutations of urls into account. I know in PHP you can use basename(), any suggestions as to a Perl solution? Just can't quite remember the module!</p

    Thanks, Tom

Re: Grepping out strings
by Anonymous Monk on Feb 02, 2005 at 16:23 UTC
    If false positives weren't a problem, this would do:
    @urls = $string =~ m!(?:\w+:/*)?(?:[\w\-]+\.)+\w+!g;
    That would grab any number of words seperated by dots (without spaces), and would catch the protocol as well, if there is one.