Having HTML::Parser problem

nysus has asked for the wisdom of the Perl Monks concerning the following question:

I'm using code basically ripped from Lincoln Stein's book to get a list of links on web pages. Here's the meat:

sub start {
    my ($parser,$tag,$attr) = @_;
    
    $parser->{last_tag} = $tag;
    return unless $tag eq 'a';
    $parser->{attr} = $attr->{href};
    $parser->handler(text => \&extract, 'self,attr,dtext');
    $parser->handler(end => \&end, 'self,tagname');
}

sub end {
    my ($parser,$tag) = @_;
    undef $parser->{last_tag};
    return unless $tag eq 'a';
    $parser->handler(text => undef);
    $parser->handler(end => undef);
}

sub extract {
    my ($parser,$attr,$text) = @_;
    if ($parser->{last_tag} eq 'a') {
        if ($parser->{attr} && $text && $text !~ /^\s*$/) {
            $text =~ s/\n*//g;
            $parser->{attr} =~ s/\n*//g;;
            push @array, $text;
            push @array, $parser->{attr};
        }
    }
}
[download]

It seemed to work beautifully until it choked on the following bit of html which is all on one line:

<font size=+1><b><A HREF="/dailyglobe2/142/metro/Plan_adopts_Romney_s_
+ideas_on_higher_education_restructuring+.shtml">Plan adopts Romney's 
+ideas on higher education restructuring</a></b></font><br>
[download]

For some reason, the parser object is reading the single <a> tag in the code above as two <a> tags. It says the first tag contains the text "Plan adopts Romney's" and the second tag contains "ideas on higher education".

I'm stumped. Like I said, the parser seems to work on every other html hyperlink it finds. And it's not that plus sign in the link because the parser works on other links with the plus sign.

Does anyone see why the parser is seeing two links here? Thanks.

$PM = "Perl Monk's";
$MCF = "Most Clueless ~~Friar~~ ~~Abbot~~ Bishop";
$nysus = $PM . $MCF;
Click here if you love Perl Monks

Comment on Having HTML::Parser problem Select or Download Code

Replies are listed 'Best First'.

Re: Having HTML::Parser problem
by pfaut (Priest) on May 23, 2003 at 01:30 UTC

From the HTML::Parser docs:

$p->unbroken_text( $bool )

By default, blocks of text are given to the text handler as soon as possible (but the parser makes sure to always break text at the boundary between whitespace and non-whitespace so single words and entities always can be decoded safely). This might create breaks that make it hard to do transformations on the text. When this attribute is enabled, blocks of text are always reported in one piece. This will delay the text event until the following (non-text) event has been recognized by the parser.

Note that the offset argspec will give you the offset of the first segment of text and length is the combined length of the segments. Since there might be ignored tags in between, these numbers can't be used to directly index in the original document file.

90% of every Perl application is already written. ⇒

dragonchild

[reply]

Re: Re: Having HTML::Parser problem

by nysus (Parson) on May 23, 2003 at 01:47 UTC

$PM = "Perl Monk's";
$MCF = "Most Clueless ~~Friar~~ ~~Abbot~~ Bishop";
$nysus = $PM . $MCF;
Click here if you love Perl Monks

[reply]

Re: Having HTML::Parser problem
by hossman (Prior) on May 23, 2003 at 06:50 UTC

You might want to consider scraping all of this code, and using the HTML::LinkExtractor facade API (Which is designed to simplify all of this for you).

[reply]

Re: Having HTML::Parser problem

by hacker (Priest) on May 24, 2003 at 13:23 UTC

HTML::LinkExtor

HTML::LinkExtractor

HTML::SimpleLinkExtor

Obligatory code (from one of my production projects):

sub extract_links {
        my $content = shift;

        my $extor = HTML::SimpleLinkExtor->new();
        $extor->parse($content);

        my %seen;
        my @hrefs       = grep { !$seen{$_}++} $extor->a;
        my @imgs        = grep { !$seen{$_}++} $extor->img;
        my @fragments   = grep {/#/} @hrefs;
        my @frag        = map make_canon($_), @fragments;

        # Other magical stuff under here to turn relative
        # links from @imgs and @hrefs into absolute, using
        # the URI module, and some other hocus pocus.
}
[download]

[reply]
[d/l]

Re: Having HTML::Parser problem
by nysus (Parson) on May 23, 2003 at 01:28 UTC

$PM = "Perl Monk's";
$MCF = "Most Clueless ~~Friar~~ ~~Abbot~~ Bishop";
$nysus = $PM . $MCF;
Click here if you love Perl Monks

[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks