pritchard12 has asked for the wisdom of the Perl Monks concerning the following question:
I am wondering what the best way to approach this problem is. I want to gather the most meaningful keywords from a website. I'm currently parsing the text from a normal html file from the net with the HTML Parser module and am using the Lingua EN Fathom module to keep track of word occurrences. I'm wondering is this the best way to approach this problem? I want to find the relevant keywords so I will be able to categorize the page. For example if the page is from espn.com about baseball, I could gather the keywords and depending on how I setup the algorithm have it determine which category is best, and label it as a sports page. I don't need help with actually assigning the website to a category, I just need to find out what the best way is to determine meaningful keywords from a website. Thanks.
Re: relevant keywords from a website
by moritz (Cardinal) on Jul 13, 2009 at 11:53 UTC
|
My approach would be to somehow obtain a database of typical word frequencies in English text.
Then the most interesting keywords are those that appear significantly more often than in "normal" English text.
Of course you will need to experiment with cut-offs (for example considering only works that appear at least twice or so), filtering out names etc. | [reply] |
Re: relevant keywords from a website
by dorward (Curate) on Jul 13, 2009 at 13:54 UTC
|
I'm using WWW::Yahoo::KeywordExtractor for this. It shunts the work off to a third party, but if you don't mind the loss of control, it takes very little time to implement. | [reply] |
Re: relevant keywords from a website
by ig (Vicar) on Jul 13, 2009 at 12:18 UTC
|
You might try applying bayesian filters. There are various modules using such filters, many of them developed for detecting SPAM. If you don't want to use these filters operationally, you could train the filter then review its probabilities to decide which keywords are relevant.
AI::Categorizer looks particularly interesting, though I haven't used it. Otherwise you can search CPAN for bayes to see what might suit your needs.
| [reply] |
Re: relevant keywords from a website
by Gavin (Archbishop) on Jul 13, 2009 at 17:57 UTC
|
A search of CPAN using "keywords" turns up a few modules that may help you in addition to the module suggested by dorward.
Better still why not simply parse the existing keywords within the "meta names keyword content" in the source code. Someone has already taken the time to extract what the page is all about and listed them!
| [reply] |
|
I think that is a good idea to look at the meta data, the reason I ask for another method is simply because I don't trust everyone and I think on more amateur sites you would more accurately categorize it by the content than by the meta data. But I am open to trying both ways and letting the results speak for them self.
| [reply] |
|