comment on

Hey,

I adjusted my code correctly to extract the urls from the reauters headlines.

However, when printing out the urls it looks like:

http://www.reuters.com/newsArticle.jhtml;jsessionid=1GRGO0RUSCREMCRBAE
+0CFFA?type=businessNews&storyID=4512094§ion=news 
http://www.reuters.com/newsArticle.jhtml;jsessionid=1GRGO0RUSCREMCRBAE
+0CFFA?type=businessNews&storyID=4512054§ion=news 
http://www.reuters.com/newsArticle.jhtml;jsessionid=1GRGO0RUSCREMCRBAE
+0CFFA?type=businessNews&storyID=4512041§ion=news
[download]

If you copy and paste one of those url's, it will bring you to a blank reuters template, part being because at the end part of the url where it has "§ion=news", should really be "'&'section=news".

Somehow its translating the "'&'section=news" into "§ion=news".

Could it be because I'm using MIME-Base32 and not MIME-Base64 module? --I'm on a Windows machine.

Adjusted code:

#!/usr/bin/perl -w

use strict;
use HTML::TokeParser;
use LWP::Simple;
use URI;
print "Content-type: text/html\n\n";

my $filename = 'temp.html';

open FH, ">$filename";
print FH get("http://www.reuters.com/newsEarlierArticles.jhtml?type=bu
+sinessNews");
close FH;

my $stream = HTML::TokeParser->new($filename)
  || die "Couldn't read HTML file $filename: $!";

while(my $token = $stream->get_token) {

    if ($token->[0] eq 'S' and $token->[1] eq 'td' and
       ($token->[2]{'class'} || '') eq 'earlyHeadline') {

my(@next) = ($stream->get_token);

if ($next[0] and $next[0][0] eq 'S' and $next[0][1] eq 'a' and defined
+ $next[0][2]{'href'} )  {
    #early headline found for business section/grab a href portion
            print URI->new_abs($next[0][2]{'href'}, 'http://www.reuter
+s.com/'), "\n";
        }

    }

  }
[download]

Thank you,
Anthony

In reply to Re: Re: HTML::TokeParser help - parsing headlines by perleager
in thread HTML::TokeParser help - parsing headlines by perleager

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks