Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Jiggy w/ LinkExtor

by amearse (Sexton)
on Aug 08, 2001 at 22:45 UTC ( [id://103164]=perlquestion: print w/replies, xml ) Need Help??

amearse has asked for the wisdom of the Perl Monks concerning the following question:

Howdy Monks,

I am working on a parser to grab all the unsubcribe links from a big text file. The text file is a mix of plain text and HTML. I am able to use HTML:LinkExtor to grab most of the links, however, at this point it returns 'a href's and img src's' I'm only interested in the 'a href's' and once I have these, I would like to narrow them down with a regex.

As of now it looks like this:

#!/usr/bin/perl use HTML::LinkExtor; use URI::URL; $p = HTML::LinkExtor->new(\&cb, "http://www.x10.com"); sub cb { my($tag, %links) = @_; print "$tag @{[%links]}\n"; } $p->parse_file("rfl.txt"); #@glob = $p; #for($i=0; $i<@glob; $i++){ # $_ = @glob[$i]; # if(/account.cgi/){ # $counter = 1 - $counter; # print $_ ; # } #}
I plan to uncomment the regex portion when I get better results.

I know there are a lot of errors, and I appreciate any guidance. Incidently, I can't use strict, because I get these errors when I do.

Global symbol "$p" requires explicit package name at link.pl line 9. Global symbol "$p" requires explicit package name at link.pl line 14. Execution of link.pl aborted due to compilation errors.
So my main objectives are to remove any 'img src' references, and make sure that all the URL's are stored properly in an array which I can parse further.

Here is the top portion of my current results. I also noticed that some of the URL's are not returned or incomplete.

a href http://www.x10.com/3D%22http://hop.clickbank.net/?aaso2/intelli +%22 a href http://www.x10.com/3D%22http://www.consumerinfo.com/home_pca.as +p?sc=3D141 = a href http://www.x10.com/3D%22http://hop.clickbank.net/?aaso2/webpd%2 +2 a href http://www.x10.com/3D%22http://www.x10.com/xcam2_allspecial33.h +tm%22 a href http://www.x10.com/3D%22http://www.teamnova.com/encore/combo.cf +m?siteid=3 D= a href http://www.x10.com/3D%22http://hop.clickbank.net/?aaso2/intel= a href http://www.x10.com/3D%22http://hop.clickbank.net/?aaso2/intel= img src http://www.x10.com img src http://www.x10.com a href http://www.x10.com/jecn@allaboutspe= img src http://www.x10.com img src http://www.x10.com img src http://www.x10.com a href http://www.x10.com/3D%22http://www.consumerinfo.com/home_pca.as +p?sc=3D14= img src http://www.x10.com/= img src http://www.x10.com
I appreciate any help you can give.

Bests,
amearse

Replies are listed 'Best First'.
Re: Jiggy w/ LinkExtor
by thatguy (Parson) on Aug 08, 2001 at 22:56 UTC
Re: Jiggy w/ LinkExtor
by one4k4 (Hermit) on Aug 08, 2001 at 22:54 UTC
    Lets try the script with the following added:
    #!/usr/bin/perl -w use strict;
    The -w is turns on "warn" which will help clue you in to any errors it can find or warn you about. use strict follows the same concept. Its a general "first rule" when developing, or for all code in general. Its a big help.

    _14k4 - perlmonks@poorheart.com (www.poorheart.com)
      Hey one4k4, I tried the warn, with use strict; and got the same error? Could it be cause I'm using activestate?
        Try using 'strict' and 'w' and 'my' variables.
        my $p = "local, sort of";

        That is a local (lexically-scoped) variable.

        'strict' perl won't let you use a global (package-scoped) variable unless you look like you really mean it. :)

        $main::p = "global, sort of";

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://103164]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-25 06:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found