Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^4: REGEX for url

by wrkrbeee (Scribe)
on Apr 25, 2016 at 21:28 UTC ( [id://1161483]=note: print w/replies, xml ) Need Help??


in reply to Re^3: REGEX for url
in thread REGEX for url

Not sure if this helps, but the full text block, from <html> through </html> appears below. Just using $/ as a way to indicate the end of a record. I apologize for wasting your time.

<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>EDGAR Filing Documents for 0000927356-01-000365</title> <link rel="stylesheet" type="text/css" href="/include/interactive.css" + /> </head> <body style="margin: 0"> <noscript><div style="color:red; font-weight:bold; text-align:center;" +>This page uses Javascript. Your browser either doesn't support Javas +cript or you have it turned off. To see this page as it is meant to a +ppear please use a Javascript enabled browser.</div></noscript> <!-- BEGIN BANNER --> <div id="headerTop"> <div id="Nav"><a href="http://www.sec.gov/index.htm">Home</a> | <a +href="/cgi-bin/browse-edgar?action=getcurrent">Latest Filings</a> | < +a href="javascript:history.back()">Previous Page</a></div> <div id="seal"><a href="http://www.sec.gov/index.htm"><img src="/im +ages/sealTop.gif" alt="SEC Seal" border="0" /></a></div> <div id="secWordGraphic"><img src="/images/bannerTitle.gif" alt="SE +C Banner" /></div> </div> <div id="headerBottom"> <div id="searchHome"><a href="/edgar/searchedgar/webusers.htm">Sear +ch the Next-Generation EDGAR System</a></div> <div id="PageTitle">Filing Detail</div> </div> <!-- END BANNER --> <!-- BEGIN BREADCRUMBS --> <div id="breadCrumbs"> <ul> <li><a href="http://www.sec.gov/">SEC Home</a> &#187;</li> <li><a href="/edgar/searchedgar/webusers.htm">Search the Next-Ge +neration EDGAR System</a> &#187;</li> <li><a href="/edgar/searchedgar/companysearch.html">Company Sear +ch</a> &#187;</li> <li class="last">Current Page</li> </ul> </div> <!-- END BREADCRUMBS --> <div id="contentDiv"> <div id="formDiv"> <!-- START FILING DIV --> <div id="formHeader"> <div id="formName"> <strong>Form 10-K</strong> - Annual report [Section 13 and 15 +(d), not S-K Item 405] </div> <div id="secNum"> <strong><acronym title="Securities and Exchange Commission">S +EC</acronym> Accession <acronym title="Number">No.</acronym></strong> + 0000927356-01-000365 </div> </div> <div class="formContent"> <div class="formGrouping"> <div class="infoHead">Filing Date</div> <div class="info">2001-03-30</div> <div class="infoHead">Accepted</div> <div class="info">1995-09-28 00:00:00</div> <div class="infoHead">Documents</div> <div class="info">10</div> </div> <div class="formGrouping"> <div class="infoHead">Period of Report</div> <div class="info">2000-12-30</div> </div> <div style="clear:both"></div> </div> <!-- END FILING DIV --> <!-- START DOCUMENT DIV --> <div style="padding: 0px 0px 4px 0px; font-size: 12px; margin: 0px +2px 0px 5px; width: 100%; overflow:hidden"> <p>Document Format Files</p> <table class="tableFile" summary="Document Format Files"> <tr> <th scope="col" style="width: 5%;"><acronym title="Sequenc +e Number">Seq</acronym></th> <th scope="col" style="width: 40%;">Description</th> <th scope="col" style="width: 20%;">Document</th> <th scope="col" style="width: 10%;">Type</th> <th scope="col">Size</th> </tr> <tr> <td scope="row">1</td> <td scope="row">ANNUAL REPORT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0001.txt">0001.txt</a></td> <td scope="row">10-K</td> <td scope="row">194594</td> </tr> <tr class="blueRow"> <td scope="row">2</td> <td scope="row">EMPLOYMENT AGREEMENT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0002.txt">0002.txt</a></td> <td scope="row">EX-10.6</td> <td scope="row">18708</td> </tr> <tr> <td scope="row">3</td> <td scope="row">CHANGE IN TERMS AGREEMENT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0003.txt">0003.txt</a></td> <td scope="row">EX-10.9</td> <td scope="row">24380</td> </tr> <tr class="blueRow"> <td scope="row">4</td> <td scope="row">FIRST AMENDMENT TO LEASE AGREEMENT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0004.txt">0004.txt</a></td> <td scope="row">EX-10.12</td> <td scope="row">15945</td> </tr> <tr> <td scope="row">5</td> <td scope="row">THIRD AMENDMENT TO LEASE</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0005.txt">0005.txt</a></td> <td scope="row">EX-10.19</td> <td scope="row">3127</td> </tr> <tr class="blueRow"> <td scope="row">6</td> <td scope="row">FOURTH AMENDMENT TO LEASE</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0006.txt">0006.txt</a></td> <td scope="row">EX-10.20</td> <td scope="row">3887</td> </tr> <tr> <td scope="row">7</td> <td scope="row">FIFTH AMENDMENT TO LEASE</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0007.txt">0007.txt</a></td> <td scope="row">EX-10.21</td> <td scope="row">3980</td> </tr> <tr class="blueRow"> <td scope="row">8</td> <td scope="row">SIXTH AMENDMENT TO LEASE</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0008.txt">0008.txt</a></td> <td scope="row">EX-10.22</td> <td scope="row">4017</td> </tr> <tr> <td scope="row">9</td> <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0009.txt">0009.txt</a></td> <td scope="row">EX-21.1</td> <td scope="row">700</td> </tr> <tr class="blueRow"> <td scope="row">10</td> <td scope="row">CONSENT OF INDEPENDENT PUBLIC ACCOUNTANTS< +/td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0010.txt">0010.txt</a></td> <td scope="row">EX-23.1</td> <td scope="row">346</td> </tr> <tr> <td scope="row">&nbsp;</td> <td scope="row">Complete submission text file</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365.txt">0000927356-01-000365.txt</a> +</td> <td scope="row">&nbsp;</td> <td scope="row">272254</td> </tr> </table> </div> <!-- END DOCUMENT DIV --> </div> <!-- START FILER DIV --> <div id="filerDiv"> <div class="mailer">Mailing Address <span class="mailerAddress">13751 S WADSWORTH PARK DR SUITE D-14 +0</span> <span class="mailerAddress"> DRAPER UT 84020 </span> </div> <div class="mailer">Business Address <span class="mailerAddress">13751 S WADSWORTH PARK DR SUITE D-14 +0</span> <span class="mailerAddress"> DRAPER UT 84020 </span> <span class="mailerAddress">8015728225</span> </div> <div class="companyInfo"> <span class="companyName">1 800 CONTACTS INC (Filer) <acronym title="Central Index Key">CIK</acronym>: <a href="/cgi-bin/b +rowse-edgar?CIK=0001050122&amp;action=getcompany">0001050122 (see all + company filings)</a></span> <p class="identInfo"><acronym title="Internal Revenue Service Number"> +IRS No.</acronym>: <strong>870571643</strong> | State of Incorp.: <st +rong>DE</strong> | Fiscal Year End: <strong>1231</strong><br />Type: +<strong>10-K</strong> | Act: <strong>34</strong> | File No.: <a href= +"/cgi-bin/browse-edgar?filenum=000-23633&amp;action=getcompany"><stro +ng>000-23633</strong></a> | Film No.: <strong>1587687</strong><br />< +acronym title="Standard Industrial Code">SIC</acronym>: <b><a href="/ +cgi-bin/browse-edgar?action=getcompany&amp;SIC=3827&amp;owner=include +">3827</a></b> Optical Instruments &amp; Lenses<br />Assistant Direct +or 10</p> </div> <div class="clear"></div> </div> <!-- END FILER DIV --> </div> </body> </html>

Replies are listed 'Best First'.
Re^5: REGEX for url
by Marshall (Canon) on Apr 25, 2016 at 22:24 UTC
    I ran this code as essentially suggested by james28909 against your data set. This approach has obvious flaws in terms of HTML structure, because there are href's that you don't care about. A module to parse this would be better.

    #!usr/bin/perl use warnings; use strict; my $line; while (my $line = <DATA>) { (my $url) = $line =~ m/.*a href="(.*)".*/; next unless $url; print "$url\n"; } =Prints javascript:history.back() http://www.sec.gov/index.htm"><img src="/images/sealTop.gif" alt="SEC +Seal" border="0 /edgar/searchedgar/webusers.htm http://www.sec.gov/ /edgar/searchedgar/webusers.htm /edgar/searchedgar/companysearch.html /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +001.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +002.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +003.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +004.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +005.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +006.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +007.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +008.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +009.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +010.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365.t +xt /cgi-bin/browse-edgar?CIK=0001050122&amp;action=getcompany /cgi-bin/browse-edgar?action=getcompany&amp;SIC=3827&amp;owner=include Process completed successfully =cut __DATA__ <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>EDGAR Filing Documents for 0000927356-01-000365</title> <link rel="stylesheet" type="text/css" href="/include/interactive.css" + /> </head> <body style="margin: 0"> <noscript><div style="color:red; font-weight:bold; text-align:center;" +>This page uses Javascript. Your browser either doesn't support Javas +cript or you have it turned off. To see this page as it is meant to a +ppear please use a Javascript enabled browser.</div></noscript> <!-- BEGIN BANNER --> .... abreviated to reduce space.....

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1161483]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-03-28 12:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found