read HTML <title> tag

AngusScrimm has asked for the wisdom of the Perl Monks concerning the following question:

Newbie here. I'm trying to get the data from just the <title> tag of an HTML page.

I have some Perl code (cobbled together from some online examples) that can read the data from an HTML file, and I have an example code snippet that is supposed to read just the <title> tag.

My problem is figuring out how to make the two pieces of code work together. Or maybe I'm going down the wrong path. Any advice would be appreciated.

Here's the code to read in all the data from the HTML file:

 #!/usr/bin/perl -w

use strict;
package Example;
require HTML::Parser;

@Example::ISA = qw(HTML::Parser);

my $parser = Example->new;
$parser->parse_file('index2.html');
print $parser->{TEXT};

sub text
{
  my ($self,$text) = @_;
  $self->{TEXT} .= $text;
}
[download]

And here's the code snippet, listed on the CPAN page for HTML::Parser, for extracting just the <title> tag data:

sub start_handler
  {
    return if shift ne "title";
    my $self = shift;
    $self->handler(text => sub { print shift }, "dtext");
    $self->handler(end  => sub { shift->eof if shift eq "title"; },
                           "tagname,self");
  }

  my $p = HTML::Parser->new(api_version => 3);
  $p->handler( start => \&start_handler, "tagname,self");
  $p->parse_file(shift || die) || die $!;
  print "\n";
[download]

Comment on read HTML <title> tag Select or Download Code

Replies are listed 'Best First'.
Re: read HTML <title> tag by Corion (Patriarch) on May 31, 2005 at 13:50 UTC
Most likely, you want to use HTML::HeadParser instead of using HTML::Parser yourself. `use strict; use HTML::HeadParser; $p = HTML::HeadParser->new; $p->parse_file('index2.html'); # and print "not finished"; print "Title is ", $p->header('Title');` [download]	[reply] [d/l]
Re: read HTML <title> tag by dbwiz (Curate) on May 31, 2005 at 14:06 UTC
HTML::Parser is not the easiest way of parsing HTML. Besides Corion's suggestion about HTML::HeadParser, if you need to parse more than the document's title, you may want to get acquainted with HTML::TokeParser. Here is a way of finding the title: `#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser; my $p = HTML::TokeParser->new('index.html') or die "can't open\n"; while (my $token = $p->get_token) { if ($token->[0] eq "S" and lc $token->[1] eq 'title') { my $title = $p->get_text() \|\| "<NO TITLE FOUND>"; print "$title\n"; last; } }` [download]	[reply] [d/l]
Re: read HTML <title> tag by jeffa (Bishop) on May 31, 2005 at 14:19 UTC
And to add one more shiny metal ball to suck the brains out of your oppressors (ref), how about HTML::TokeParser::Simple? `use strict; use warnings; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new('index2.html'); while ( my $token = $p->get_token ) { next unless $token->is_tag('title'); print $p->get_token->as_is; last; }` [download] jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.


go ahead... be a heretic
	PerlMonks