Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

relevant keywords from a website

by pritchard12 (Initiate)
on Jul 13, 2009 at 11:47 UTC ( [id://779541]=perlquestion: print w/replies, xml ) Need Help??

pritchard12 has asked for the wisdom of the Perl Monks concerning the following question:

I am wondering what the best way to approach this problem is. I want to gather the most meaningful keywords from a website. I'm currently parsing the text from a normal html file from the net with the HTML Parser module and am using the Lingua EN Fathom module to keep track of word occurrences. I'm wondering is this the best way to approach this problem? I want to find the relevant keywords so I will be able to categorize the page. For example if the page is from espn.com about baseball, I could gather the keywords and depending on how I setup the algorithm have it determine which category is best, and label it as a sports page. I don't need help with actually assigning the website to a category, I just need to find out what the best way is to determine meaningful keywords from a website. Thanks.

Replies are listed 'Best First'.
Re: relevant keywords from a website
by moritz (Cardinal) on Jul 13, 2009 at 11:53 UTC
    My approach would be to somehow obtain a database of typical word frequencies in English text.

    Then the most interesting keywords are those that appear significantly more often than in "normal" English text.

    Of course you will need to experiment with cut-offs (for example considering only works that appear at least twice or so), filtering out names etc.

Re: relevant keywords from a website
by dorward (Curate) on Jul 13, 2009 at 13:54 UTC
    I'm using WWW::Yahoo::KeywordExtractor for this. It shunts the work off to a third party, but if you don't mind the loss of control, it takes very little time to implement.
Re: relevant keywords from a website
by ig (Vicar) on Jul 13, 2009 at 12:18 UTC

    You might try applying bayesian filters. There are various modules using such filters, many of them developed for detecting SPAM. If you don't want to use these filters operationally, you could train the filter then review its probabilities to decide which keywords are relevant.

    AI::Categorizer looks particularly interesting, though I haven't used it. Otherwise you can search CPAN for bayes to see what might suit your needs.

Re: relevant keywords from a website
by Gavin (Archbishop) on Jul 13, 2009 at 17:57 UTC

    A search of CPAN using "keywords" turns up a few modules that may help you in addition to the module suggested by dorward.

    Better still why not simply parse the existing keywords within the "meta names keyword content" in the source code.

    Someone has already taken the time to extract what the page is all about and listed them!

      I think that is a good idea to look at the meta data, the reason I ask for another method is simply because I don't trust everyone and I think on more amateur sites you would more accurately categorize it by the content than by the meta data. But I am open to trying both ways and letting the results speak for them self.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://779541]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (None)
    As of 2024-04-25 01:07 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found