I ran this code as essentially suggested by
james28909 against your data set. This approach has obvious flaws in terms of HTML structure, because there are href's that you don't care about. A module to parse this would be better.
#!usr/bin/perl
use warnings;
use strict;
my $line;
while (my $line = <DATA>)
{
(my $url) = $line =~ m/.*a href="(.*)".*/;
next unless $url;
print "$url\n";
}
=Prints
javascript:history.back()
http://www.sec.gov/index.htm"><img src="/images/sealTop.gif" alt="SEC
+Seal" border="0
/edgar/searchedgar/webusers.htm
http://www.sec.gov/
/edgar/searchedgar/webusers.htm
/edgar/searchedgar/companysearch.html
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+001.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+002.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+003.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+004.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+005.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+006.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+007.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+008.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+009.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+010.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365.t
+xt
/cgi-bin/browse-edgar?CIK=0001050122&action=getcompany
/cgi-bin/browse-edgar?action=getcompany&SIC=3827&owner=include
Process completed successfully
=cut
__DATA__
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>EDGAR Filing Documents for 0000927356-01-000365</title>
<link rel="stylesheet" type="text/css" href="/include/interactive.css"
+ />
</head>
<body style="margin: 0">
<noscript><div style="color:red; font-weight:bold; text-align:center;"
+>This page uses Javascript. Your browser either doesn't support Javas
+cript or you have it turned off. To see this page as it is meant to a
+ppear please use a Javascript enabled browser.</div></noscript>
<!-- BEGIN BANNER -->
.... abreviated to reduce space.....