Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

(Ovid) Re: Searching for web sites

by Ovid (Cardinal)
on Oct 24, 2000 at 21:30 UTC ( [id://38151]=note: print w/replies, xml ) Need Help??


in reply to Searching for web sites

You may wish to check out the HTML::FromText module. It will, amongst other things, automatically convert URLs to hyperlinks. I've never worked with .plan files, so I can't say for certain whether this is an appropriate solution, but I suspect that it's a good place to start.

Also, if you wish to do it by hand, switching to a different delimeter on your regexes will help you avoid backslashitis. Further, if your URLs are not broken across lines (i.e., if they don't have embedded newline) or have spaces, your could try the following (untested) regex as a starting point for conversion:

$newline =~ s#(http://[^.]+\.[^.]+\S+)#<a href="$1">$1</a>#gi;
The above regex assumes that, at minimum, you will have two groups to characters separated by a period after the http:// portion. The negated character classes should actually be replaced by classes that state allowable characters (and if you really want to be anal, I recall that the first allowable character in a domain is different from other allowable characters, but sometimes I get into regex overkill).

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

Replies are listed 'Best First'.
RE: (Ovid) Re: Searching for web sites
by electronicMacks (Beadle) on Oct 25, 2000 at 03:53 UTC
    If you’re using such a through regex that checks for dots and allowable characters, you may wish to ditch the http:// completely. People are more likely to list websites in their .plan files without it (for example, I visit perlmonks.org and not I visit http://www.perlmonks.org) Personally I’d feel safe putting anchor tags around anything that looks like xxx.xxx, although you could also include a list of allowable Top Level Domains, something like @TLDs = ("com","net", "org", "edu","us","nl","de","it","se","ch","uk","ca","hr","ae","br","jp","be","us","au","ie","ar","fi","mil","gov","sg","es","mx","no","pt","dk","il","ru","nz","th","pl","id","cy","in","kw","at","za","cn","fr","is","ro","kr","gr","co","ph","bo","hu","cr","pe","cl","tr","arpa","tw","eg","ee","ge","ua","om","ec","hk","ve","ag","cz","ni","to","nu","sm","ni","lt","yu","bg","ba","do","qa","ck","mt","bf","lu","su","bh");

      Isn't this a little dangerous? Any time new TLD's are added you will need to go and change the list, plus I cannot see .cx, home of a bunch of free software projects in this list.

      http:// or at least www(\..+)+\.\w+ seem the safest matches

        Lets not forget either that InterNIC just released the .god domain.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://38151]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-04-24 02:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found