Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

read HTML <title> tag

by AngusScrimm (Initiate)
on May 31, 2005 at 13:46 UTC ( [id://462051]=perlquestion: print w/replies, xml ) Need Help??

AngusScrimm has asked for the wisdom of the Perl Monks concerning the following question:

Newbie here. I'm trying to get the data from just the <title> tag of an HTML page.

I have some Perl code (cobbled together from some online examples) that can read the data from an HTML file, and I have an example code snippet that is supposed to read just the <title> tag.

My problem is figuring out how to make the two pieces of code work together. Or maybe I'm going down the wrong path. Any advice would be appreciated.

Here's the code to read in all the data from the HTML file:

#!/usr/bin/perl -w use strict; package Example; require HTML::Parser; @Example::ISA = qw(HTML::Parser); my $parser = Example->new; $parser->parse_file('index2.html'); print $parser->{TEXT}; sub text { my ($self,$text) = @_; $self->{TEXT} .= $text; }

And here's the code snippet, listed on the CPAN page for HTML::Parser, for extracting just the <title> tag data:

sub start_handler { return if shift ne "title"; my $self = shift; $self->handler(text => sub { print shift }, "dtext"); $self->handler(end => sub { shift->eof if shift eq "title"; }, "tagname,self"); } my $p = HTML::Parser->new(api_version => 3); $p->handler( start => \&start_handler, "tagname,self"); $p->parse_file(shift || die) || die $!; print "\n";

Replies are listed 'Best First'.
Re: read HTML <title> tag
by Corion (Patriarch) on May 31, 2005 at 13:50 UTC

    Most likely, you want to use HTML::HeadParser instead of using HTML::Parser yourself.

    use strict; use HTML::HeadParser; $p = HTML::HeadParser->new; $p->parse_file('index2.html'); # and print "not finished"; print "Title is ", $p->header('Title');
Re: read HTML <title> tag
by dbwiz (Curate) on May 31, 2005 at 14:06 UTC

    HTML::Parser is not the easiest way of parsing HTML.

    Besides Corion's suggestion about HTML::HeadParser, if you need to parse more than the document's title, you may want to get acquainted with HTML::TokeParser. Here is a way of finding the title:

    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser; my $p = HTML::TokeParser->new('index.html') or die "can't open\n"; while (my $token = $p->get_token) { if ($token->[0] eq "S" and lc $token->[1] eq 'title') { my $title = $p->get_text() || "<NO TITLE FOUND>"; print "$title\n"; last; } }
Re: read HTML <title> tag
by jeffa (Bishop) on May 31, 2005 at 14:19 UTC

    And to add one more shiny metal ball to suck the brains out of your oppressors (ref), how about HTML::TokeParser::Simple?

    use strict; use warnings; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new('index2.html'); while ( my $token = $p->get_token ) { next unless $token->is_tag('title'); print $p->get_token->as_is; last; }

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://462051]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2024-04-19 08:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found