Regexp to extract HTML link data

hatter has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to work out a regexp which when given either:

$in = '<td><img src="foo.jpg"><a href="index3.html">New index</a></td>
+';
or
$in = '<td><a href="index3.html">New index</a></td>';
[download]

will give me the link data, regardless, and the image data, should there be one. I've tried various combinations after the initial my ($new,$hit) = ($in =~ m#(foo.jpg)?.*(<a href=.*</a>)#m); It looks simple enough, but has stumped a couple of my friends, too. I'm trying to do it in a single regexp - although the actual problem could check for the bits separately, it's got me stumped enough to want an answer, out of curiousity (and doing it in two bits makes the rest of the code more complicated) FWIW, the link data varies, the image data is static.

the hatter

Title edit by tye

Comment on Regexp to extract HTML link data Select or Download Code

Replies are listed 'Best First'.
Re: Regexp riddles by broquaint (Abbot) on Jul 17, 2003 at 12:42 UTC
Under the blind assumption that your data won't be changing too much or becomes 'faulty' (otherwise you'd be using a parser right?) then something like this ought do `my $re = qr{ (?: <img \s+ .? src=" ([^"]+) " .? > )? <a \s+ .? href=" ([^"]+) " .? > }x; $in = '<td><img src="foo.jpg"><a href="index3.html">New index</a></td>'; my($href, $img) = grep defined, reverse $in =~ $re; print "href - $href\nimg - $img\n"; $in = '<td><a href="index3.html">New index</a></td>'; ($href, $img) = grep defined, reverse $in =~ $re; print "href - $href\nimg - $img\n"; __output__ href - index3.html img - foo.jpg href - index3.html img -` [download] See. `perlre` for more info. HTH `_________ broquaint`	[reply] [d/l]
Re: Re: Regexp riddles by hatter (Pilgrim) on Jul 17, 2003 at 14:06 UTC
Thanks, that looks like the ticket. And your assumptions are correct - HTML parsers, um, no thank you. The input happens to be HTML, but it's very simple, fairly fixed format, and the problem could just as easily be expressed without HTML tags. And I'm hoping to wrap it all up in a map() (lots of data to iterate over) so it's much neater. Now, off to spend more time staring hard at the solution until its lessons burn themselves deep into my brain. thanks `the hatter`	[reply] [d/l]
Re: Regexp riddles by dragonchild (Archbishop) on Jul 17, 2003 at 12:36 UTC
Don't parse HTML with a regex. Use HTML::Parser - that's why it exists. ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]
Re: Regexp riddles by Abigail-II (Bishop) on Jul 17, 2003 at 12:38 UTC
Your problem isn't well defined. How much variation can there be? If you have to support all HTML possibilities, you'd be much better of using a parser. Anyway, here's an untested attempt. Most likely, it breaks on your second example: `my (undef, $new, $hit) = $in =~ m{ <td><img \s+ src \s* = \s* (["']) foo[.]jpg \1 > (<a \s+ href \s* = (["']) [^"']* \4 > [^<]* </a>}ix;` [download] Abigail	[reply] [d/l]
Re: Regexp riddles by Aristotle (Chancellor) on Jul 17, 2003 at 13:08 UTC
Try HTML::LinkExtractor - should be less trouble than figuring out a regex that works. Makeshifts last the longest.	[reply]
Re: Regexp riddles by demerphq (Chancellor) on Jul 17, 2003 at 14:37 UTC
I dunno, this didnt seem to be too dificult, so I have a feeling ive gone wrong here somewhere. But heres my go. `use strict; use warnings; foreach ('<td><img src="foo.jpg">'. '<a href="index3.html">New index</a></td>', '<td><a href="index3.html">New index</a></td>') { if (/<td>(?:<img[ ]src="([^"]+)">)? <a[ ]href="([^"]+)">((?:(?!<\/a>).)*) <\/a>/six) { print "Matched!\tImg=", ($1 ? $1 : 'None'), "\tLink: $2\t Link Text: $3\n"; } }` [download] sorry about the weird look of the code its mostly like that to fit average settings on the site. --- demerphq _{<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...}	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom