Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Linkifying URLs in plain text

by grinder (Bishop)
on Nov 21, 2008 at 15:35 UTC ( #725157=perlquestion: print w/replies, xml ) Need Help??

grinder has asked for the wisdom of the Perl Monks concerning the following question:

It's Friday afternoon and my brain is fried.

I have a boatload of plain text that contains things that look like http links. If I find one, then I want to wrap it in HTML anchors.

I don't think there's a module that does this, but I welcome suggestions. Otherwise, the spec is in the test below. If someone could show me the error of my ways I'd be a happy man.

Bonus points for converting the first link (http://www.example.com) to http://www.example.com/ in the HREF (add the omitted slash).

use strict; use warnings; my $html = <<END_HTML; Blah blah blah Web: http://www.example.com info.example.com Web: http://info.example.com/ And another Web site: http://another.example.com/ (doesn't always work) Nice blog here: http://blog.example.com/niceblog/ END_HTML my $target = <<END_HTML; Blah blah blah Web: <a href="http://www.example.com">www.example.com</a> job.example.com Web: <a href="http://info.example.com/">info.example.com</a> And another Web site: <a href="http://another.example.com/">another.example.com</a +> (doesn't always work) Nice blog here: <a href="http://blog.example.com/niceblog/">blog.examp +le.com</a> END_HTML use Test::More tests => 1; $html =~ s{(http://(\S+)(?:/\S*)?)}{<a href="$1">$2</a>}g; is($html, $target);

• another intruder with the mooring in the heart of the Perl

Replies are listed 'Best First'.
Re: Linkifying URLs in plain text
by jeffa (Bishop) on Nov 21, 2008 at 16:28 UTC

    This seems to work. :)

    use strict; use warnings; use Data::Dumper; use URI::Find; my $data = do {local $/;<DATA>}; my $finder = URI::Find->new( sub { my($uri, $orig_uri) = @_; return qq|<a href="$uri">$orig_uri</a>|; } ); $finder->find( \$data ); print $data; __DATA__ Blah blah blah Web: http://www.example.com info.example.com Web: http://info.example.com/ And another Web site: http://another.example.com/ (doesn't always work) Nice blog here: http://blog.example.com/niceblog/

    UPDATE: change the inner sub to this to get exactly what you want:

    sub { my $u = URI->new( $_[0] ); return sprintf( '<a href="%s">%s%s</a>', $u, $u->host, $u->path ); }

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: Linkifying URLs in plain text
by moritz (Cardinal) on Nov 21, 2008 at 16:00 UTC
    This is the regex I use to recognize URLs in plain text:
    use Regexp::Common qw(URI); my $re = qr/$RE{URI}{HTTP}(?:#[\w_%:-]+)?(?<![.,])/;

    The negative look-ahead takes care that in http://example.com/something, the comma isn't treated as part of the URL (it is a valid part of the URL, but usually you don't want to include it nevertheless).

    For adding the trailing / and extracting the host name you need a bit more logic, for which I'm too lazy right now to write. I think that Regexp::Common has an option to capture the domain name somehow, but I haven't investigated in that either.

    I hope this is of interest nonetheless.

      For adding the trailing / and extracting the host name you need a bit more logic

      URI's ->canonical and ->host will do that, respectively.

Re: Linkifying URLs in plain text
by Anonymous Monk on Nov 21, 2008 at 17:39 UTC
Re: Linkifying URLs in plain text
by eye (Chaplain) on Nov 22, 2008 at 08:23 UTC
    Bonus points for converting the first link (http://www.example.com) to http://www.example.com/ in the HREF (add the omitted slash).
    Be cautious of mechanically adding a trailing slash to a "directory" name. While doing this is usually desirable, there are some exceptions. As an example, Google's linux search page does not like this:
     
        http://www.google.com/linux -> linux specific search page
        http://www.google.com/linux/ -> 404 error

    I'm not aware of any site where it is a problem to add a trailing slash to a domain name, but it's only a matter of time before someone finds a new way to be too clever for their own good.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://725157]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2022-05-20 08:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (73 votes). Check out past polls.

    Notices?