Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Having HTML::Parser problem

by nysus (Parson)
on May 23, 2003 at 01:23 UTC ( [id://260298]=perlquestion: print w/replies, xml ) Need Help??

nysus has asked for the wisdom of the Perl Monks concerning the following question:

I'm using code basically ripped from Lincoln Stein's book to get a list of links on web pages. Here's the meat:
sub start { my ($parser,$tag,$attr) = @_; $parser->{last_tag} = $tag; return unless $tag eq 'a'; $parser->{attr} = $attr->{href}; $parser->handler(text => \&extract, 'self,attr,dtext'); $parser->handler(end => \&end, 'self,tagname'); } sub end { my ($parser,$tag) = @_; undef $parser->{last_tag}; return unless $tag eq 'a'; $parser->handler(text => undef); $parser->handler(end => undef); } sub extract { my ($parser,$attr,$text) = @_; if ($parser->{last_tag} eq 'a') { if ($parser->{attr} && $text && $text !~ /^\s*$/) { $text =~ s/\n*//g; $parser->{attr} =~ s/\n*//g;; push @array, $text; push @array, $parser->{attr}; } } }

It seemed to work beautifully until it choked on the following bit of html which is all on one line:
<font size=+1><b><A HREF="/dailyglobe2/142/metro/Plan_adopts_Romney_s_ +ideas_on_higher_education_restructuring+.shtml">Plan adopts Romney's +ideas on higher education restructuring</a></b></font><br>

For some reason, the parser object is reading the single <a> tag in the code above as two <a> tags. It says the first tag contains the text "Plan adopts Romney's" and the second tag contains "ideas on higher education".

I'm stumped. Like I said, the parser seems to work on every other html hyperlink it finds. And it's not that plus sign in the link because the parser works on other links with the plus sign.

Does anyone see why the parser is seeing two links here? Thanks.

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop";
$nysus = $PM . $MCF;
Click here if you love Perl Monks

Replies are listed 'Best First'.
Re: Having HTML::Parser problem
by pfaut (Priest) on May 23, 2003 at 01:30 UTC

    From the HTML::Parser docs:

    $p->unbroken_text( $bool )

    By default, blocks of text are given to the text handler as soon as possible (but the parser makes sure to always break text at the boundary between whitespace and non-whitespace so single words and entities always can be decoded safely). This might create breaks that make it hard to do transformations on the text. When this attribute is enabled, blocks of text are always reported in one piece. This will delay the text event until the following (non-text) event has been recognized by the parser.

    Note that the offset argspec will give you the offset of the first segment of text and length is the combined length of the segments. Since there might be ignored tags in between, these numbers can't be used to directly index in the original document file.

    90% of every Perl application is already written.
    dragonchild
      That did the trick. I will have to bone up on this module because I certainly don't quite know how or why. Beautiful, thanks.

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop";
      $nysus = $PM . $MCF;
      Click here if you love Perl Monks

Re: Having HTML::Parser problem
by hossman (Prior) on May 23, 2003 at 06:50 UTC

    You might want to consider scraping all of this code, and using the HTML::LinkExtractor facade API (Which is designed to simplify all of this for you).

      I used to be a user of HTML::LinkExtor (I've never personally used HTML::LinkExtractor), but I've recently found HTML::SimpleLinkExtor, which works beautifully, and in less code. If you have to install a CPAN module, why not use one that gets your job done faster and easier!

      Obligatory code (from one of my production projects):

      sub extract_links { my $content = shift; my $extor = HTML::SimpleLinkExtor->new(); $extor->parse($content); my %seen; my @hrefs = grep { !$seen{$_}++} $extor->a; my @imgs = grep { !$seen{$_}++} $extor->img; my @fragments = grep {/#/} @hrefs; my @frag = map make_canon($_), @fragments; # Other magical stuff under here to turn relative # links from @imgs and @hrefs into absolute, using # the URI module, and some other hocus pocus. }
Re: Having HTML::Parser problem
by nysus (Parson) on May 23, 2003 at 01:28 UTC
    Slight correction to above: the parser says the second tag contains "ideas on higher education restructuring".

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop";
    $nysus = $PM . $MCF;
    Click here if you love Perl Monks

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://260298]
Approved by cciulla
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-26 04:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found