REGEX for url

wrkrbeee has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: REGEX for url by graff (Chancellor) on Apr 25, 2016 at 21:44 UTC
It looks like you're just trying to extract values of `href=` attributes from anchor tags (i.e. the "..." from `<a href="...">`) in html data. I'm surprised that no one yet has mentioned that there are CPAN modules for doing exactly that - e.g. HTML::LinkExtor, among others. (I haven't had occasion to use them myself. but to do what you're doing, I'd start with one of those.)	[reply] [d/l] [select]
Re^2: REGEX for url by wrkrbeee (Scribe) on Apr 25, 2016 at 21:46 UTC
You are exactly right, extract data between anchor tags. I will try the CPAN module you mentioned. Thank you!!	[reply]
Re^3: REGEX for url by graff (Chancellor) on Apr 25, 2016 at 21:52 UTC
Having looked a little more at the CPAN search results, I find it odd that the man page for HTML::LinkExtor appears to be shorter and simpler than the one for HTML::SimpleLinkExtor -- I'm not sure what "Simple" is supposed to refer to in the latter module.	[reply]
Re: REGEX for url by tangent (Parson) on Apr 25, 2016 at 22:15 UTC
Others have suggested HTML::LinkExtor. Here is a way to do it using HTML::TreeBuilder::XPath. Very handy if you need to extract other information from the file. `use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file("/path/to/file.html"); $tree->eof; my @links = $tree->findnodes('//a') ; for my $link ( @links ){ print $link->attr('href'), "\n"; }` [download] That will print every link. If you only want the links from the table then: `my @links = $tree->findnodes('//td/a') ; for my $link ( @links ){ print $link->attr('href'), "\n"; }` [download] Output: /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +001.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +002.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +003.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +004.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +005.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +006.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +007.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +008.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +009.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +010.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365.t +xt [download]	[reply] [d/l] [select]
Re: REGEX for url by james28909 (Deacon) on Apr 25, 2016 at 20:42 UTC
`my $line = '<td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-00¡0365-0009.txt">0009.txt</a></td>'; $line =~ s/.a href="(.)".*/$1/; print $line;` [download]	[reply] [d/l]
Re^2: REGEX for url by wrkrbeee (Scribe) on Apr 25, 2016 at 20:52 UTC
Thank you for your help! That expression does not seem to bind to anything for me, something else perhaps that I"m doing wrong? Below is a small amount of the code. Thanks again! `$/="</html>"; while (my $line = <$FH_IN>) { chomp $line; #removes line break or new line; my $url_sub = ""; my $data=""; $url_sub =~ s/.a href="(.)".*/$1/; print $url_sub;` [download]	[reply] [d/l]
Re^3: REGEX for url by james28909 (Deacon) on Apr 25, 2016 at 20:57 UTC
This works for me: `use strict; use warnings; for(<DATA>){ print if s/.a href="(.)".*/$1/; } __DATA__ <td scope="row">9</td> <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0009.txt">0009.txt</a></td> <td scope="row">EX-21.1</td>` [download] Output: `C:\Users\James\Desktop\perlmonks>perlmonks.pl /Archives/edgar/data/1050122/000092735601000365/0000927356-01-00¡0365- +0009.txt` [download] EDIT: It seems that `$/ = "</html>";` manipulates the input record seperator in such a way it does completely break the functionality of the simple regex. Do yu have any links to documentation on this $/ = "</html>"; ?	[reply] [d/l] [select]
Re^4: REGEX for url by wrkrbeee (Scribe) on Apr 25, 2016 at 21:28 UTC
Re^5: REGEX for url by Marshall (Canon) on Apr 25, 2016 at 22:24 UTC
Re^4: REGEX for url by wrkrbeee (Scribe) on Apr 25, 2016 at 21:09 UTC
Re^5: REGEX for url by NetWallah (Canon) on Apr 25, 2016 at 21:19 UTC
Re^5: REGEX for url by ExReg (Priest) on Apr 25, 2016 at 22:07 UTC
Re: REGEX for url by ww (Archbishop) on Apr 26, 2016 at 20:36 UTC
I downvoted the OP (belatedly). Here's why: "Among other expressions, I have tried: m/subsid(.)(">)/" ... and not even in code tags, at that. Missing from your regex: modifiers to make it case-insentive and multi-line... and context (even if simplified) to make it easy for us to spot non-regex errors. The code in your narrative doesn't even come close to doing what you say* you want. It's time for you to do some reading -- in this case, perlretut and friends -- and stop typing in poorly constructed questions every time you face an issue. Also, you've posted too much data: if you've stated your intention precisely, then there's no need for the entire html for Row 9 of the table. This is a very poor post, even given the low quality of your recent nodes. So here's a crummy example (see much better suggestions above re modules) constructed solely to demonstrate that if you're going down the (fool's) path of trying to parse html with a regex, it can be done. It's so bad an example that I feel free to offer it to a gimmé-artist: #!/usr/bin/perl use strict; use warnings; my @lines = <DATA>; for my $line(@lines) { print "\| $line \|"; if ($line =~ /(<a href.+<\/a>)/) { # note, no need to capture the + whole of row 9 print "$1 \n\n"; } else { print "Crummy regex\n" } } __DATA__ <td scope="row">9</td> <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/000 +092735601000365/0000927356-01-000365-0009.txt">0009.txt</a></td> <td scope="row">EX-21.1</td></> And here's execution: <c>C:>wrkrbeejunk.pl \| <td scope="row">9</td> \|Crummy regex \| <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> \|Crummy regex \| <td scope="row"><a href="/Archives/edgar/data/1050122/0 +00092735601000365/0000927356-01-000365-0009.txt">0009.txt</a></td> \|<a href="/Archives/edgar/data/1050122/000092735601000365/0000927356- +01-000365-0009.txt">0009.txt</a> \| <td scope="row">EX-21.1</td> \|Crummy regex C:\> [download] *Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:* code verbatim error and/or warning messages *a coherent explanation of what "doesn't work* actually means.**	[reply] [d/l]


Think about Loose Coupling
	PerlMonks