Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

REGEX for url

by wrkrbeee (Scribe)
on Apr 25, 2016 at 20:32 UTC ( #1161477=perlquestion: print w/replies, xml ) Need Help??

wrkrbeee has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks, have a fairly straightforward string (in the code section below) which contains, amidst other characters, a URL address. The goal is to extract the URL address and assign it to a variable. Try as I might, I keep coming up short. Among other expressions, I have tried: m/subsid(.*)(">)/ Grateful for any ideas. Thank you!

<td scope="row">9</td> <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0009.txt">0009.txt</a></td> <td scope="row">EX-21.1</td>

Replies are listed 'Best First'.
Re: REGEX for url
by graff (Chancellor) on Apr 25, 2016 at 21:44 UTC
    It looks like you're just trying to extract values of href= attributes from anchor tags (i.e. the "..." from <a href="...">) in html data.

    I'm surprised that no one yet has mentioned that there are CPAN modules for doing exactly that - e.g. HTML::LinkExtor, among others. (I haven't had occasion to use them myself. but to do what you're doing, I'd start with one of those.)

      You are exactly right, extract data between anchor tags. I will try the CPAN module you mentioned. Thank you!!
        Having looked a little more at the CPAN search results, I find it odd that the man page for HTML::LinkExtor appears to be shorter and simpler than the one for HTML::SimpleLinkExtor -- I'm not sure what "Simple" is supposed to refer to in the latter module.
Re: REGEX for url
by tangent (Vicar) on Apr 25, 2016 at 22:15 UTC
    Others have suggested HTML::LinkExtor. Here is a way to do it using HTML::TreeBuilder::XPath. Very handy if you need to extract other information from the file.
    use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file("/path/to/file.html"); $tree->eof; my @links = $tree->findnodes('//a') ; for my $link ( @links ){ print $link->attr('href'), "\n"; }
    That will print every link. If you only want the links from the table then:
    my @links = $tree->findnodes('//td/a') ; for my $link ( @links ){ print $link->attr('href'), "\n"; }
    /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +001.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +002.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +003.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +004.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +005.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +006.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +007.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +008.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +009.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +010.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365.t +xt
Re: REGEX for url
by james28909 (Deacon) on Apr 25, 2016 at 20:42 UTC
    my $line = '<td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0009.txt">0009.txt</a></td>'; $line =~ s/.*a href="(.*)".*/$1/; print $line;

      Thank you for your help! That expression does not seem to bind to anything for me, something else perhaps that I"m doing wrong? Below is a small amount of the code. Thanks again!

      $/="</html>"; while (my $line = <$FH_IN>) { chomp $line; #removes line break or new line; my $url_sub = ""; my $data=""; $url_sub =~ s/.*a href="(.*)".*/$1/; print $url_sub;
        This works for me:
        use strict; use warnings; for(<DATA>){ print if s/.*a href="(.*)".*/$1/; } __DATA__ <td scope="row">9</td> <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0009.txt">0009.txt</a></td> <td scope="row">EX-21.1</td>


        C:\Users\James\Desktop\perlmonks> /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365- +0009.txt

        EDIT: It seems that $/ = "</html>"; manipulates the input record seperator in such a way it does completely break the functionality of the simple regex. Do yu have any links to documentation on this $/ = "</html>"; ?

Re: REGEX for url
by ww (Archbishop) on Apr 26, 2016 at 20:36 UTC

    I downvoted the OP (belatedly). Here's why:

    "Among other expressions, I have tried: m/subsid(.*)(">)/" ... and not even in code tags, at that.

    Missing from your regex: modifiers to make it case-insentive and multi-line... and context (even if simplified) to make it easy for us to spot non-regex errors.

    The code in your narrative doesn't even come close to doing what you say you want. It's time for you to do some reading -- in this case, perlretut and friends -- and stop typing in poorly constructed questions every time you face an issue.

    Also, you've posted too much data: if you've stated your intention precisely, then there's no need for the entire html for Row 9 of the table. This is a very poor post, even given the low quality of your recent nodes.

    So here's a crummy example (see much better suggestions above re modules) constructed solely to demonstrate that if you're going down the (fool's) path of trying to parse html with a regex, it can be done. It's so bad an example that I feel free to offer it to a gimmé-artist:

    #!/usr/bin/perl use strict; use warnings; my @lines = <DATA>; for my $line(@lines) { print "| $line |"; if ($line =~ /(<a href.+<\/a>)/) { # note, no need to capture the + whole of row 9 print "$1 \n\n"; } else { print "Crummy regex\n" } } __DATA__ <td scope="row">9</td> <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/000 +092735601000365/0000927356-01-000365-0009.txt">0009.txt</a></td> <td scope="row">EX-21.1</td></> And here's execution: <c>C:> | <td scope="row">9</td> |Crummy regex | <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> |Crummy regex | <td scope="row"><a href="/Archives/edgar/data/1050122/0 +00092735601000365/0000927356-01-000365-0009.txt">0009.txt</a></td> |<a href="/Archives/edgar/data/1050122/000092735601000365/0000927356- +01-000365-0009.txt">0009.txt</a> | <td scope="row">EX-21.1</td> |Crummy regex C:\>

    Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:
    1. code
    2. verbatim error and/or warning messages
    3. a coherent explanation of what "doesn't work actually means.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1161477]
Approved by graff
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2022-07-01 07:48 GMT
Find Nodes?
    Voting Booth?
    My most frequent journeys are powered by:

    Results (98 votes). Check out past polls.