Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Regexp to extract HTML link data

by hatter (Pilgrim)
on Jul 17, 2003 at 12:30 UTC ( #275196=perlquestion: print w/replies, xml ) Need Help??

hatter has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to work out a regexp which when given either:
$in = '<td><img src="foo.jpg"><a href="index3.html">New index</a></td> +'; or $in = '<td><a href="index3.html">New index</a></td>';
will give me the link data, regardless, and the image data, should there be one. I've tried various combinations after the initial my ($new,$hit) = ($in =~ m#(foo.jpg)?.*(<a href=.*</a>)#m); It looks simple enough, but has stumped a couple of my friends, too. I'm trying to do it in a single regexp - although the actual problem could check for the bits separately, it's got me stumped enough to want an answer, out of curiousity (and doing it in two bits makes the rest of the code more complicated) FWIW, the link data varies, the image data is static.

the hatter

Title edit by tye

Replies are listed 'Best First'.
Re: Regexp riddles
by broquaint (Abbot) on Jul 17, 2003 at 12:42 UTC
    Under the blind assumption that your data won't be changing too much or becomes 'faulty' (otherwise you'd be using a parser right?) then something like this ought do
    my $re = qr{ (?: <img \s+ .*? src=" ([^"]+) " .*? > )? <a \s+ .*? href=" ([^"]+) " .*? > }x; $in = '<td><img src="foo.jpg"><a href="index3.html">New index</a></td>'; my($href, $img) = grep defined, reverse $in =~ $re; print "href - $href\nimg - $img\n"; $in = '<td><a href="index3.html">New index</a></td>'; ($href, $img) = grep defined, reverse $in =~ $re; print "href - $href\nimg - $img\n"; __output__ href - index3.html img - foo.jpg href - index3.html img -
    See. perlre for more info.


      Thanks, that looks like the ticket. And your assumptions are correct - HTML parsers, um, no thank you. The input happens to be HTML, but it's very simple, fairly fixed format, and the problem could just as easily be expressed without HTML tags. And I'm hoping to wrap it all up in a map() (lots of data to iterate over) so it's much neater.

      Now, off to spend more time staring hard at the solution until its lessons burn themselves deep into my brain.


      the hatter

Re: Regexp riddles
by dragonchild (Archbishop) on Jul 17, 2003 at 12:36 UTC
    Don't parse HTML with a regex. Use HTML::Parser - that's why it exists.

    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Regexp riddles
by Abigail-II (Bishop) on Jul 17, 2003 at 12:38 UTC
    Your problem isn't well defined. How much variation can there be? If you have to support all HTML possibilities, you'd be much better of using a parser.

    Anyway, here's an untested attempt. Most likely, it breaks on your second example:

    my (undef, $new, $hit) = $in =~ m{ <td><img \s+ src \s* = \s* (["']) foo[.]jpg \1 > (<a \s+ href \s* = (["']) [^"']* \4 > [^<]* </a>}ix;


Re: Regexp riddles
by Aristotle (Chancellor) on Jul 17, 2003 at 13:08 UTC
    Try HTML::LinkExtractor - should be less trouble than figuring out a regex that works.

    Makeshifts last the longest.

Re: Regexp riddles
by demerphq (Chancellor) on Jul 17, 2003 at 14:37 UTC

    I dunno, this didnt seem to be too dificult, so I have a feeling ive gone wrong here somewhere. But heres my go.

    use strict; use warnings; foreach ('<td><img src="foo.jpg">'. '<a href="index3.html">New index</a></td>', '<td><a href="index3.html">New index</a></td>') { if (/<td>(?:<img[ ]src="([^"]+)">)? <a[ ]href="([^"]+)">((?:(?!<\/a>).)*) <\/a>/six) { print "Matched!\tImg=", ($1 ? $1 : 'None'), "\tLink: $2\t Link Text: $3\n"; } }

    sorry about the weird look of the code its mostly like that to fit average settings on the site.


    <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://275196]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2022-01-18 20:33 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (54 votes). Check out past polls.