Grepping out strings

existem has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I have a little problem and I want the most cool/efficient (but still understandable) way of solving it.

I basically have a long string, for example.

<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouseOver="ss('go to www.mysite.com')" onMouseOut="cs()" class=ch onClick="ga(this,event)"><a id=pa1 href=/pagead/iclk?adurl=http://www.mysite.com&sa=l&ai=BRqXfIssAQqj8EcfeRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZgBgkugAZDsuf8DyAEB onMouseOver="return ss('go to www.mysite.com')" onMouseOut="cs()">MySite wine merchant</a> www.mysite2.com      Award winning wines at great prices  Money back satisfaction guarantee</td><td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align=right valign=top onMouseOver="ss('go to www.mysite2.com')" onMouseOut="cs()" onClick="ga(this,event)">Sponsored Links</td></tr>

I want the best way to get all the domain names out of it.

So for example in the above I would get, http://www.mysite.com and http://www.mysite2.com as the output.

I have tried doing this with the LinkExtor.pm module from cpan, but i'm not sure if this is really necessary. Can I just do it with a regular expression? Reading the string in and storing all occurences of http://www.somedomain.com ?

Thank you, your advice is as ever very much appreciated.

Tom.

Comment on Grepping out strings Download Code

Replies are listed 'Best First'.
Re: Grepping out strings by Zaxo (Archbishop) on Feb 02, 2005 at 13:13 UTC
Assuming protocol and all, it's pretty easy with Regexp::Common: `use Regexp::Common qw/URI/; my @results; while ($string =~/$RE{URI}{-keep}/g) { push @results, $1; }` [download] Your sample seems to omit the protocol and other bits that make a URI. Do you know they are all http? Try looking at the regex from `R::C::URI` and pick out the host-and-domain part. It will start with two literal slashes and run until the next slash or space, whichever comes first. After Compline, Zaxo	[reply] [d/l]
Re: Grepping out strings by g0n (Priest) on Feb 02, 2005 at 13:11 UTC
`$_ =~ /(http:\/\/[\w\.]+)/g;` seems to work OK for URLs starting http://, but that doesn't match www.mysite2.com as that doesn't have a http prefix in your example string. VGhpcyBtZXNzYWdlIGludGVudGlvbmFsbHkgcG9pbnRsZXNz	[reply] [d/l]
Re: Grepping out strings by manigandans (Initiate) on Feb 02, 2005 at 13:12 UTC
Hi Tom, Try out the following regular expression: my $data = "<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouseOv +er=ss('go to www.mysite.com') onMouseOut=cs() class=ch onClick=ga(thi +s,event)><a id=pa1 href=/pagead/iclk?adurl=http://www.mysite.com&sa=l&ai=BRqXfIssAQqj8Ecf +eRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZgBgugAZDsuf8DyAEB +onMouseOver=return ss('go to www.mysite.com') onMouseOut=cs()><b>MySi +te</b> wine merchant</a><font size=-1><br><span class=a>www.mysite2.c +om</span>      Award winning wines at g +reat prices  Money back satisfaction guarantee</font></td>< +td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align=right valign=top +onMouseOver=ss('go to www.mysite2.com') onMouseOut=cs() onClick=ga(th +is,event)><font size=-1 class=f>Sponsored Links</font></td></tr>"; my @output = ($data =~ /www.\S+\.com/g); print join ("\n", @output); [download] Mani. Edited by demerphq -- added code tags and basic markup	[reply] [d/l]
Re: Grepping out strings by perlsen (Chaplain) on Feb 02, 2005 at 15:25 UTC
Hi, just try this my $input = q(<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouse +Ov+er=ss('go to www.mysite.com') onMouseOut=cs() class=ch onClick=ga( +thi+s,event)><a id=pa1 href=/pagead/iclk?adurl=http://www.mysite.com&sa=l&ai=BRqXfIssAQqj8Ecf ++eRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZgBgugAZDsuf8DyAEB + +onMouseOver=return ss('go to www.mysite.com') onMouseOut=cs()><b>My +Si+te</b> wine merchant</a><font size=-1><br><span class=a>www.mysite +2.com</span>      Award winning wines a +t g+reat prices  Money back satisfaction guarantee</font></ +td><+td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align=right valign +=top +onMouseOver=ss('go to www.mysite2.com') onMouseOut=cs() onClick +=ga(th+is,event)><font size=-1 class=f>Sponsored Links</font></td></t +r>); my @arr=(); while($input=~m#((http://)*www.+?\.\w+) ?'?#gsi) { push(@arr, $1); } print "$_\n" for @arr; #outputs in @arr: www.mysite.com http://www.mysite.com www.mysite.com www.mysite2.com www.mysite2.com [download]	[reply] [d/l]
Re: Grepping out strings by manigandans (Initiate) on Feb 02, 2005 at 13:19 UTC
Hi Tom, Assume that $data has the long string you've specified and try out the following regexp: my @output = ($data =~ /www.\S+\.com/g); print join ("\n", @output); Mani	[reply]
Re: Grepping out strings by jpk236 (Monk) on Feb 02, 2005 at 14:22 UTC
Tom, This might not be as fancy as some of the other suggestions, but it will work for both domains. #!/usr/bin/perl $data = "<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouseOver= +ss('go to www.mysite.com') onMouseOut=cs() class=ch onClick=ga(this,e +vent)><a id=pa1 href=/pagead/iclk?adurl=http://www.mysite.com&sa=l&ai +=BRqXfIssAQqj8EcfeRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZg +BgkugAZDsuf8DyAEB onMouseOver=return ss('go to www.mysite.com') onMou +seOut=cs()><b>MySite</b> wine merchant</a><font size=-1><br><span cla +ss=a>www.mysite2.com</span>      Award +winning wines at great prices  Money back satisfaction guar +antee</font></td><td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align +=right valign=top onMouseOver=ss('go to www.mysite2.com') onMouseOut= +cs() onClick=ga(this,event)><font size=-1 class=f>Sponsored Links</fo +nt></td></tr>"; ($domain1) = $data =~ /=ss$'go to (\S)'$.$/; ($domain2) = $data =~ /^.=ss$'go to (\S)'$/; $domain1 = "http://".$domain1; $domain2 = "http://".$domain2; print "Domain1 :: $domain1\n"; print "Domain2 :: $domain2\n"; [download] - Justin	[reply] [d/l]
Re: Grepping out strings by Anonymous Monk on Feb 02, 2005 at 16:23 UTC
If false positives weren't a problem, this would do: `@urls = $string =~ m!(?:\w+:/*)?(?:[\w\-]+\.)+\w+!g;` [download] That would grab any number of words seperated by dots (without spaces), and would catch the protocol as well, if there is one.	[reply] [d/l]
Re: Grepping out strings by Popcorn Dave (Abbot) on Feb 02, 2005 at 18:37 UTC
In addition to looking at using a regex, you might also consider taking a look at HTML::TokeParser. If you're trying to parse a whole page, I think that HTML::TokeParser makes it much easier to find your data. Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.	[reply]
Re: Grepping out strings by existem (Sexton) on Feb 02, 2005 at 15:07 UTC
Thanks, i've kind of used a combination of just about every approach suggested and I think it has put me on the right track now. I also have to contend with .co.uk address and any other kind of domain and protocol other than http, so it's going to be a bit of a hack I think to get this to work exactly as I want it to, but thanks for the help this far. I don't suppose anybody knows which module has the effect of giving me just the domain out of a URL? So for example if I have. `http://www.majestic.co.uk/webapp/wcs/stores/servlet/ReferrerEntryPoint%3FaffiliateId%3D1206%26redirect%3DContentView` I just want to get http://www.majestic.co.uk/. But it could also be in the form. `http://www.chilloutdrink.com&sa=l&ai=B1O33J-sAQtOwG7vORJGm4OQDg` So I want to use a module that will take all different permutations of urls into account. I know in PHP you can use basename(), any suggestions as to a Perl solution? Just can't quite remember the module!</p Thanks, Tom	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom