http://qs321.pair.com?node_id=465382

cajun has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to grab the 'keywords' from the header of a web page. I found in Perl & LWP, where I thought '$response->header('keywords')' would grab this for me. But as of yet it has not worked, nor have I been able to figure out why it isn't working.

I've looked at the docs for LWP, LWP::Simple, LWP::UserAgent, lwpcook, just to name a few.

Thanks for any suggestions.

#!/usr/bin/perl -w use strict; use LWP; use LWP::UserAgent; my $browser = LWP::UserAgent->new(); $browser->env_proxy(); # if we're behind a firewall my $url = 'http://www.somewebsite.com/'; my $response = $browser->get($url); die "Error \"", $response->status_line(), "\" when getting $url" unless $response->is_success(); my $keywords = $response->header('keywords'); print $keywords;

Replies are listed 'Best First'.
Re: WWW Keywords
by atcroft (Abbot) on Jun 10, 2005 at 02:03 UTC

    It sounds as if you are confusing what is meant by a "header". In your context, you sound as if you mean that between the <head> </head> tags; the meaning of "header" with regards to LWP is the HTTP message headers, which are interpretted by the browser before the page content is processed. Try looking at one of the methods for processing a standard webpage, paying attention to the content within the HEAD tags.

    Hope that helps.

Re: WWW Keywords
by kaif (Friar) on Jun 10, 2005 at 04:46 UTC

    Here's a solution using HTML::HeadParser:

    # Just to get the content use LWP::Simple; $html = get("http://www.perlmonks.org/"); # To parse the HTML header use HTML::HeadParser; $p = HTML::HeadParser->new; $p->parse($html); $keywords = $p->header( 'X-Meta-Keywords' ); print "$keywords\n"; __END__ perl, mod_perl, regular expressions, regexp, xp whoring, CGI, programming, learning, tutorials, questions, answers, examples, vroom, tim, node, experience, votes, code

    Interesting keywords there ...

Re: WWW Keywords
by davidrw (Prior) on Jun 10, 2005 at 02:08 UTC
    My first thought was for you to print $response->as_string; to see exactly what the HTTP headers look like.. But after reading the first response, you're probably after the HTML headers (meta tags, etc inside <head></head> tags) ... In that case, take a look at HTML::Parser, specifically in the EXAMPLES section where it extracts the <title> text.
Re: WWW Keywords
by cajun (Chaplain) on Jun 10, 2005 at 03:19 UTC
    Thanks atcroft and davidrw. You are both correct. I am looking for the information between <head> and </head>. Yes, I was confused on the 'Headers' vs 'HEAD' issue.

    I'm looking at the docs for HTML::Parser and HTTP::Headers, which seems to be related.

    Thanks for pointing me in the correct direction.