Linkifying URLs in plain text

grinder has asked for the wisdom of the Perl Monks concerning the following question:

It's Friday afternoon and my brain is fried.

I have a boatload of plain text that contains things that look like http links. If I find one, then I want to wrap it in HTML anchors.

I don't think there's a module that does this, but I welcome suggestions. Otherwise, the spec is in the test below. If someone could show me the error of my ways I'd be a happy man.

Bonus points for converting the first link (http://www.example.com) to http://www.example.com/ in the HREF (add the omitted slash).

use strict;
use warnings;

my $html = <<END_HTML;
Blah blah blah
Web: http://www.example.com
info.example.com
Web: http://info.example.com/
And another
Web site: http://another.example.com/ (doesn't always work)
Nice blog here: http://blog.example.com/niceblog/
END_HTML

my $target = <<END_HTML;
Blah blah blah
Web: <a href="http://www.example.com">www.example.com</a>
job.example.com
Web: <a href="http://info.example.com/">info.example.com</a>
And another
Web site: <a href="http://another.example.com/">another.example.com</a
+> (doesn't always work)
Nice blog here: <a href="http://blog.example.com/niceblog/">blog.examp
+le.com</a>
END_HTML

use Test::More tests => 1;

$html =~ s{(http://(\S+)(?:/\S*)?)}{<a href="$1">$2</a>}g;
is($html, $target);
[download]

• another intruder with the mooring in the heart of the Perl

Comment on Linkifying URLs in plain text Download Code

Replies are listed 'Best First'.
Re: Linkifying URLs in plain text by jeffa (Bishop) on Nov 21, 2008 at 16:28 UTC
This seems to work. :) `use strict; use warnings; use Data::Dumper; use URI::Find; my $data = do {local $/;<DATA>}; my $finder = URI::Find->new( sub { my($uri, $orig_uri) = @_; return qq\|<a href="$uri">$orig_uri</a>\|; } ); $finder->find( \$data ); print $data; __DATA__ Blah blah blah Web: http://www.example.com info.example.com Web: http://info.example.com/ And another Web site: http://another.example.com/ (doesn't always work) Nice blog here: http://blog.example.com/niceblog/` [download] UPDATE: change the inner sub to this to get exactly what you want: `sub { my $u = URI->new( $_[0] ); return sprintf( '<a href="%s">%s%s</a>', $u, $u->host, $u->path ); }` [download] jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l] [select]
Re: Linkifying URLs in plain text by moritz (Cardinal) on Nov 21, 2008 at 16:00 UTC
This is the regex I use to recognize URLs in plain text: `use Regexp::Common qw(URI); my $re = qr/$RE{URI}{HTTP}(?:#[\w_%:-]+)?(?<![.,])/;` [download] The negative look-ahead takes care that in `http://example.com/something,` the comma isn't treated as part of the URL (it is a valid part of the URL, but usually you don't want to include it nevertheless). For adding the trailing / and extracting the host name you need a bit more logic, for which I'm too lazy right now to write. I think that Regexp::Common has an option to capture the domain name somehow, but I haven't investigated in that either. I hope this is of interest nonetheless.	[reply] [d/l] [select]
Re^2: Linkifying URLs in plain text by ikegami (Patriarch) on Nov 21, 2008 at 20:05 UTC
For adding the trailing / and extracting the host name you need a bit more logic URI's `->canonical` and `->host` will do that, respectively.	[reply] [d/l] [select]
Re: Linkifying URLs in plain text by Anonymous Monk on Nov 21, 2008 at 17:39 UTC
Change URIs in Text to HTML-Links, How do I replace a URL with a clickable hyperlink?	[reply]
Re: Linkifying URLs in plain text by eye (Chaplain) on Nov 22, 2008 at 08:23 UTC
Bonus points for converting the first link (http://www.example.com) to http://www.example.com/ in the HREF (add the omitted slash). Be cautious of mechanically adding a trailing slash to a "directory" name. While doing this is usually desirable, there are some exceptions. As an example, Google's linux search page does not like this: http://www.google.com/linux -> linux specific search page http://www.google.com/linux/ -> 404 error I'm not aware of any site where it is a problem to add a trailing slash to a domain name, but it's only a matter of time before someone finds a new way to be too clever for their own good.	[reply]


Welcome to the Monastery
	PerlMonks