Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Grepping out strings

by existem (Sexton)
on Feb 02, 2005 at 12:50 UTC ( #427227=perlquestion: print w/replies, xml ) Need Help??

existem has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I have a little problem and I want the most cool/efficient (but still understandable) way of solving it.

I basically have a long string, for example.

<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouseOver="ss('go to www.mysite.com')" onMouseOut="cs()" class=ch onClick="ga(this,event)"><a id=pa1 href=/pagead/iclk?adurl=http://www.mysite.com&sa=l&ai=BRqXfIssAQqj8EcfeRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZgBgkugAZDsuf8DyAEB onMouseOver="return ss('go to www.mysite.com')" onMouseOut="cs()"><b>MySite</b> wine merchant</a><font size=-1><br><span class=a>www.mysite2.com</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Award winning wines at great prices&nbsp;&nbsp;Money back satisfaction guarantee</font></td><td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align=right valign=top onMouseOver="ss('go to www.mysite2.com')" onMouseOut="cs()" onClick="ga(this,event)"><font size=-1 class=f>Sponsored Links</font></td></tr>

I want the best way to get all the domain names out of it.

So for example in the above I would get, http://www.mysite.com and http://www.mysite2.com as the output.

I have tried doing this with the LinkExtor.pm module from cpan, but i'm not sure if this is really necessary. Can I just do it with a regular expression? Reading the string in and storing all occurences of http://www.somedomain.com ?

Thank you, your advice is as ever very much appreciated.

Tom.

Replies are listed 'Best First'.
Re: Grepping out strings
by Zaxo (Archbishop) on Feb 02, 2005 at 13:13 UTC

    Assuming protocol and all, it's pretty easy with Regexp::Common:

    use Regexp::Common qw/URI/; my @results; while ($string =~/$RE{URI}{-keep}/g) { push @results, $1; }
    Your sample seems to omit the protocol and other bits that make a URI. Do you know they are all http? Try looking at the regex from R::C::URI and pick out the host-and-domain part. It will start with two literal slashes and run until the next slash or space, whichever comes first.

    After Compline,
    Zaxo

Re: Grepping out strings
by g0n (Priest) on Feb 02, 2005 at 13:11 UTC
    $_ =~ /(http:\/\/[\w\.]+)/g;

    seems to work OK for URLs starting http://, but that doesn't match www.mysite2.com as that doesn't have a http prefix in your example string.

    VGhpcyBtZXNzYWdlIGludGVudGlvbmFsbHkgcG9pbnRsZXNz
Re: Grepping out strings
by manigandans (Initiate) on Feb 02, 2005 at 13:12 UTC

    Hi Tom,

    Try out the following regular expression:

    my $data = "<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouseOv +er=ss('go to www.mysite.com') onMouseOut=cs() class=ch onClick=ga(thi +s,event)><a id=pa1 href=/pagead/iclk?adurl=http://www.mysite.com&sa=l&ai=BRqXfIssAQqj8Ecf +eRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZgBgugAZDsuf8DyAEB +onMouseOver=return ss('go to www.mysite.com') onMouseOut=cs()><b>MySi +te</b> wine merchant</a><font size=-1><br><span class=a>www.mysite2.c +om</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Award winning wines at g +reat prices&nbsp;&nbsp;Money back satisfaction guarantee</font></td>< +td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align=right valign=top +onMouseOver=ss('go to www.mysite2.com') onMouseOut=cs() onClick=ga(th +is,event)><font size=-1 class=f>Sponsored Links</font></td></tr>"; my @output = ($data =~ /www.\S+\.com/g); print join ("\n", @output);

    Mani.

    Edited by demerphq -- added code tags and basic markup
Re: Grepping out strings
by perlsen (Chaplain) on Feb 02, 2005 at 15:25 UTC

    Hi, just try this

    my $input = q(<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouse +Ov+er=ss('go to www.mysite.com') onMouseOut=cs() class=ch onClick=ga( +thi+s,event)><a id=pa1 href=/pagead/iclk?adurl=http://www.mysite.com&sa=l&ai=BRqXfIssAQqj8Ecf ++eRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZgBgugAZDsuf8DyAEB + +onMouseOver=return ss('go to www.mysite.com') onMouseOut=cs()><b>My +Si+te</b> wine merchant</a><font size=-1><br><span class=a>www.mysite +2.com</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Award winning wines a +t g+reat prices&nbsp;&nbsp;Money back satisfaction guarantee</font></ +td><+td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align=right valign +=top +onMouseOver=ss('go to www.mysite2.com') onMouseOut=cs() onClick +=ga(th+is,event)><font size=-1 class=f>Sponsored Links</font></td></t +r>); my @arr=(); while($input=~m#((http://)*www.+?\.\w+) ?'?#gsi) { push(@arr, $1); } print "$_\n" for @arr; #outputs in @arr: www.mysite.com http://www.mysite.com www.mysite.com www.mysite2.com www.mysite2.com
Re: Grepping out strings
by manigandans (Initiate) on Feb 02, 2005 at 13:19 UTC
    Hi Tom,

    Assume that $data has the long string you've specified and try out the following regexp:

    my @output = ($data =~ /www.\S+\.com/g);

    print join ("\n", @output);

    Mani
Re: Grepping out strings
by jpk236 (Monk) on Feb 02, 2005 at 14:22 UTC
    Tom, This might not be as fancy as some of the other suggestions, but it will work for both domains.
    #!/usr/bin/perl $data = "<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouseOver= +ss('go to www.mysite.com') onMouseOut=cs() class=ch onClick=ga(this,e +vent)><a id=pa1 href=/pagead/iclk?adurl=http://www.mysite.com&sa=l&ai +=BRqXfIssAQqj8EcfeRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZg +BgkugAZDsuf8DyAEB onMouseOver=return ss('go to www.mysite.com') onMou +seOut=cs()><b>MySite</b> wine merchant</a><font size=-1><br><span cla +ss=a>www.mysite2.com</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Award +winning wines at great prices&nbsp;&nbsp;Money back satisfaction guar +antee</font></td><td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align +=right valign=top onMouseOver=ss('go to www.mysite2.com') onMouseOut= +cs() onClick=ga(this,event)><font size=-1 class=f>Sponsored Links</fo +nt></td></tr>"; ($domain1) = $data =~ /=ss\('go to (\S*)'\).*$/; ($domain2) = $data =~ /^.*=ss\('go to (\S*)'\)/; $domain1 = "http://".$domain1; $domain2 = "http://".$domain2; print "Domain1 :: $domain1\n"; print "Domain2 :: $domain2\n";
    - Justin
Re: Grepping out strings
by Anonymous Monk on Feb 02, 2005 at 16:23 UTC
    If false positives weren't a problem, this would do:
    @urls = $string =~ m!(?:\w+:/*)?(?:[\w\-]+\.)+\w+!g;
    That would grab any number of words seperated by dots (without spaces), and would catch the protocol as well, if there is one.
Re: Grepping out strings
by Popcorn Dave (Abbot) on Feb 02, 2005 at 18:37 UTC
    In addition to looking at using a regex, you might also consider taking a look at HTML::TokeParser. If you're trying to parse a whole page, I think that HTML::TokeParser makes it much easier to find your data.

    Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
Re: Grepping out strings
by existem (Sexton) on Feb 02, 2005 at 15:07 UTC
    Thanks, i've kind of used a combination of just about every approach suggested and I think it has put me on the right track now.

    I also have to contend with .co.uk address and any other kind of domain and protocol other than http, so it's going to be a bit of a hack I think to get this to work exactly as I want it to, but thanks for the help this far.

    I don't suppose anybody knows which module has the effect of giving me just the domain out of a URL?

    So for example if I have.

    http://www.majestic.co.uk/webapp/wcs/stores/servlet/ReferrerEntryPoint%3FaffiliateId%3D1206%26redirect%3DContentView

    I just want to get http://www.majestic.co.uk/.

    But it could also be in the form.

    http://www.chilloutdrink.com&sa=l&ai=B1O33J-sAQtOwG7vORJGm4OQDg

    So I want to use a module that will take all different permutations of urls into account. I know in PHP you can use basename(), any suggestions as to a Perl solution? Just can't quite remember the module!</p

    Thanks, Tom

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://427227]
Approved by pelagic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2022-08-10 20:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?