Natural Language Index Stemming

rob_au has asked for the wisdom of the Perl Monks concerning the following question:

I am curious as to the experience of others with regard to their experience with natural language stemming for site indexes. I ask this as I am in the process of rewriting a site search engine (to improve maintainability and to fit the corporate application environment) and have could across a number of discussions regarding natural language stemming in this type of application.

For those unfamiliar with this concept, stemming is the process of reducing a word to its stem or root form - This allows similar words such as computer and computing to be conflated or reduced to a single root (for example, comput), thereby reducing index dictionary size and in theory, reducing storage requirements and processing time - A further discussion on this concept can be found here.

While this type of processing allows for reducing index dictionary keys, I am concerned about he likelihood for stemming errors whereby dissimilar words may be stemmed to a similar root, particularly given that indexing speed and space requirements should not be an issue in the application environment - See here for a discussion on over- and under-stemming errors.

And so I ask a barage of questions:

What are the experiences of fellow monks with natural language stemming?

Have other monks found better results, as measured by minimal stemming errors, via one stemming algorithm (for example, Paice-Husk, Porter, etc.) over another?

And in particular, what are other monks experiences with the Porter algorithm of stemming implemented in Lingua::Stem?

My thanks in advance

Comment on Natural Language Index Stemming

Replies are listed 'Best First'.
Re: Natural Language Index Stemming by cjf (Parson) on Jun 18, 2002 at 04:58 UTC
As for Lingua::Stem, I just tried out a few examples from Stemming Performance that you linked to: `use strict; use Lingua::Stem; my $stemmer = Lingua::Stem->new(); my @words = qw/maintained maintenance environment experience/; my $stems = $stemmer->stem(@words); print "$_ " for (@$stems);` [download] The output was: `maintain mainten environ experi` [download] So it appears to have failed to merge maintain with maintenance(?), but correctly dealed with the environment/experience difference described on that page. This is the first time I've looked into the subject, so I could be a fair bit off the mark :). As for other (sort of) related modules, I've found TheDamian's Lingua::EN::Inflect to be useful (and fun) to use on occasion. I'm not sure how much that applies to your question though. ++ for an interesting thread, I look forward to hearing what your conclusions are. Edited 18 June 2002 (footpad): Fixed broken </code> tag.	[reply] [d/l] [select]
Re: Natural Language Index Stemming by samtregar (Abbot) on Jun 18, 2002 at 01:56 UTC
I built a search engine in Perl that used Glimpse as the backend searcher. It supports several varieties of stemming that were available as options in my system. It seemed to work as advertised. -sam	[reply]
Re: Natural Language Index Stemming by perrin (Chancellor) on Jun 18, 2002 at 01:25 UTC
The Porter algorithm worked well enough for us, when building the search engine for etoys.com. I haven't tried any others. The implementation we used was actually in C though.	[reply]
Re: Natural Language Index Stemming by toma (Vicar) on Jun 18, 2002 at 06:23 UTC
I used the Lingua::Stem when I made concordances of some Shakespeare and Melville texts that I dowloaded from Project Gutenberg. I found that the stemming was quite conservative for my purposes, erring on the side of avoiding collisions. My more challenging problem was the proper choice of stoplist words, which would not be indexed at all. I will someday integrate stemming into my Style and Spelling Checker, I hope. It should work perfectly the first time! - toma	[reply]
Re: Natural Language Index Stemming by simon.proctor (Vicar) on Jun 18, 2002 at 07:44 UTC
I used Paice Husk stemming for my search engine and used MLDBM and Storable for creating the index. I also used a second index to cache the HTML meta data. I quite liked Paice Husk as it translated to Perl very easily. I just had to keep the rules in an array and reverse all fragments of my search terms. If you want an alternative to Lingua::Stem then I seriously recommend it. You can find the paper here They also give an (old) Perl example which should help provide a basis of your app if you choose to try it.	[reply]
Re: Natural Language Index Stemming by PetaMem (Priest) on Jun 18, 2002 at 11:23 UTC
Aaah my lovely favourite subfield of interest... first off, you can diferenciate between knowledge based stemming algorithms and probabilistic stemming. And of course there is a bunch of heuristic mixture of these two aproaches spread all over the literature and the web. If you want something "not so good, but good enough and not expensive", you could use the next generation of old stemmer. See Snowball. Snowball is quite ok, especially because there are descriptions for more languages. However you never will be able to gain 100% accuracy with this approach, as only a dictionary of a given lang together with morphology knowledge will give you best (but still ambiguous) results. But this requires heavy duty hardware, where heavy duty software can run on... Bye PetaMem	[reply]
Re: Natural Language Index Stemming by quinkan (Monk) on Jun 19, 2002 at 06:49 UTC
See www.mds.rmit.edu.au/~msf/papgers/adcs98.pdf for one comparison of various stemming methods. It "sorta" comes down on the side of the Porter method in my mind -- but see what you think	[reply]


Perl Monk, Perl Meditation
	PerlMonks