Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^5: REGEX for url

by Marshall (Canon)
on Apr 25, 2016 at 22:24 UTC ( [id://1161490]=note: print w/replies, xml ) Need Help??


in reply to Re^4: REGEX for url
in thread REGEX for url

I ran this code as essentially suggested by james28909 against your data set. This approach has obvious flaws in terms of HTML structure, because there are href's that you don't care about. A module to parse this would be better.

#!usr/bin/perl use warnings; use strict; my $line; while (my $line = <DATA>) { (my $url) = $line =~ m/.*a href="(.*)".*/; next unless $url; print "$url\n"; } =Prints javascript:history.back() http://www.sec.gov/index.htm"><img src="/images/sealTop.gif" alt="SEC +Seal" border="0 /edgar/searchedgar/webusers.htm http://www.sec.gov/ /edgar/searchedgar/webusers.htm /edgar/searchedgar/companysearch.html /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +001.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +002.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +003.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +004.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +005.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +006.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +007.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +008.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +009.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +010.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365.t +xt /cgi-bin/browse-edgar?CIK=0001050122&amp;action=getcompany /cgi-bin/browse-edgar?action=getcompany&amp;SIC=3827&amp;owner=include Process completed successfully =cut __DATA__ <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>EDGAR Filing Documents for 0000927356-01-000365</title> <link rel="stylesheet" type="text/css" href="/include/interactive.css" + /> </head> <body style="margin: 0"> <noscript><div style="color:red; font-weight:bold; text-align:center;" +>This page uses Javascript. Your browser either doesn't support Javas +cript or you have it turned off. To see this page as it is meant to a +ppear please use a Javascript enabled browser.</div></noscript> <!-- BEGIN BANNER --> .... abreviated to reduce space.....

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1161490]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-20 02:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found