Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Grepping out strings

by existem (Sexton)
on Feb 02, 2005 at 12:50 UTC ( #427227=perlquestion: print w/replies, xml ) Need Help??

existem has asked for the wisdom of the Perl Monks concerning the following question:

I have a little problem and I want the most cool/efficient (but still understandable) way of solving it.

I basically have a long string, for example.

<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouseOver="ss('go to')" onMouseOut="cs()" class=ch onClick="ga(this,event)"><a id=pa1 href=/pagead/iclk?adurl= onMouseOver="return ss('go to')" onMouseOut="cs()"><b>MySite</b> wine merchant</a><font size=-1><br><span class=a></span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Award winning wines at great prices&nbsp;&nbsp;Money back satisfaction guarantee</font></td><td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align=right valign=top onMouseOver="ss('go to')" onMouseOut="cs()" onClick="ga(this,event)"><font size=-1 class=f>Sponsored Links</font></td></tr>

I want the best way to get all the domain names out of it.

So for example in the above I would get, and as the output.

I have tried doing this with the module from cpan, but i'm not sure if this is really necessary. Can I just do it with a regular expression? Reading the string in and storing all occurences of ?

Thank you, your advice is as ever very much appreciated.


Replies are listed 'Best First'.
Re: Grepping out strings
by Zaxo (Archbishop) on Feb 02, 2005 at 13:13 UTC

    Assuming protocol and all, it's pretty easy with Regexp::Common:

    use Regexp::Common qw/URI/; my @results; while ($string =~/$RE{URI}{-keep}/g) { push @results, $1; }
    Your sample seems to omit the protocol and other bits that make a URI. Do you know they are all http? Try looking at the regex from R::C::URI and pick out the host-and-domain part. It will start with two literal slashes and run until the next slash or space, whichever comes first.

    After Compline,

Re: Grepping out strings
by g0n (Priest) on Feb 02, 2005 at 13:11 UTC
    $_ =~ /(http:\/\/[\w\.]+)/g;

    seems to work OK for URLs starting http://, but that doesn't match as that doesn't have a http prefix in your example string.

Re: Grepping out strings
by manigandans (Initiate) on Feb 02, 2005 at 13:12 UTC

    Hi Tom,

    Try out the following regular expression:

    my $data = "<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouseOv +er=ss('go to') onMouseOut=cs() class=ch onClick=ga(thi +s,event)><a id=pa1 href=/pagead/iclk?adurl= +eRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZgBgugAZDsuf8DyAEB +onMouseOver=return ss('go to') onMouseOut=cs()><b>MySi +te</b> wine merchant</a><font size=-1><br><span class=a>www.mysite2.c +om</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Award winning wines at g +reat prices&nbsp;&nbsp;Money back satisfaction guarantee</font></td>< +td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align=right valign=top +onMouseOver=ss('go to') onMouseOut=cs() onClick=ga(th +is,event)><font size=-1 class=f>Sponsored Links</font></td></tr>"; my @output = ($data =~ /www.\S+\.com/g); print join ("\n", @output);


    Edited by demerphq -- added code tags and basic markup
Re: Grepping out strings
by perlsen (Chaplain) on Feb 02, 2005 at 15:25 UTC

    Hi, just try this

    my $input = q(<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouse +Ov+er=ss('go to') onMouseOut=cs() class=ch onClick=ga( +thi+s,event)><a id=pa1 href=/pagead/iclk?adurl= ++eRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZgBgugAZDsuf8DyAEB + +onMouseOver=return ss('go to') onMouseOut=cs()><b>My +Si+te</b> wine merchant</a><font size=-1><br><span class=a>www.mysite</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Award winning wines a +t g+reat prices&nbsp;&nbsp;Money back satisfaction guarantee</font></ +td><+td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align=right valign +=top +onMouseOver=ss('go to') onMouseOut=cs() onClick +=ga(th+is,event)><font size=-1 class=f>Sponsored Links</font></td></t +r>); my @arr=(); while($input=~m#((http://)*www.+?\.\w+) ?'?#gsi) { push(@arr, $1); } print "$_\n" for @arr; #outputs in @arr:
Re: Grepping out strings
by manigandans (Initiate) on Feb 02, 2005 at 13:19 UTC
    Hi Tom,

    Assume that $data has the long string you've specified and try out the following regexp:

    my @output = ($data =~ /www.\S+\.com/g);

    print join ("\n", @output);

Re: Grepping out strings
by jpk236 (Monk) on Feb 02, 2005 at 14:22 UTC
    Tom, This might not be as fancy as some of the other suggestions, but it will work for both domains.
    #!/usr/bin/perl $data = "<tr><td id=tpa1 nowrap bgcolor=#e5ecf9 height=40 onMouseOver= +ss('go to') onMouseOut=cs() class=ch onClick=ga(this,e +vent)><a id=pa1 href=/pagead/iclk?adurl= +=BRqXfIssAQqj8EcfeRIXF7OQD2sdO4pmhjQGsqeEKgPEECAAQARgBKAI4AECKFkifOZg +BgkugAZDsuf8DyAEB onMouseOver=return ss('go to') onMou +seOut=cs()><b>MySite</b> wine merchant</a><font size=-1><br><span cla +ss=a></span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Award +winning wines at great prices&nbsp;&nbsp;Money back satisfaction guar +antee</font></td><td id=spa1 class=ch bgcolor=#e5ecf9 height=40 align +=right valign=top onMouseOver=ss('go to') onMouseOut= +cs() onClick=ga(this,event)><font size=-1 class=f>Sponsored Links</fo +nt></td></tr>"; ($domain1) = $data =~ /=ss\('go to (\S*)'\).*$/; ($domain2) = $data =~ /^.*=ss\('go to (\S*)'\)/; $domain1 = "http://".$domain1; $domain2 = "http://".$domain2; print "Domain1 :: $domain1\n"; print "Domain2 :: $domain2\n";
    - Justin
Re: Grepping out strings
by Anonymous Monk on Feb 02, 2005 at 16:23 UTC
    If false positives weren't a problem, this would do:
    @urls = $string =~ m!(?:\w+:/*)?(?:[\w\-]+\.)+\w+!g;
    That would grab any number of words seperated by dots (without spaces), and would catch the protocol as well, if there is one.
Re: Grepping out strings
by Popcorn Dave (Abbot) on Feb 02, 2005 at 18:37 UTC
    In addition to looking at using a regex, you might also consider taking a look at HTML::TokeParser. If you're trying to parse a whole page, I think that HTML::TokeParser makes it much easier to find your data.

    Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
Re: Grepping out strings
by existem (Sexton) on Feb 02, 2005 at 15:07 UTC
    Thanks, i've kind of used a combination of just about every approach suggested and I think it has put me on the right track now.

    I also have to contend with address and any other kind of domain and protocol other than http, so it's going to be a bit of a hack I think to get this to work exactly as I want it to, but thanks for the help this far.

    I don't suppose anybody knows which module has the effect of giving me just the domain out of a URL?

    So for example if I have.

    I just want to get

    But it could also be in the form.

    So I want to use a module that will take all different permutations of urls into account. I know in PHP you can use basename(), any suggestions as to a Perl solution? Just can't quite remember the module!</p

    Thanks, Tom

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://427227]
Approved by pelagic
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2022-08-10 20:37 GMT
Find Nodes?
    Voting Booth?

    No recent polls found