rob_au has asked for the wisdom of the Perl Monks concerning the following question:
For those unfamiliar with this concept, stemming is the process of reducing a word to its stem or root form - This allows similar words such as computer and computing to be conflated or reduced to a single root (for example, comput), thereby reducing index dictionary size and in theory, reducing storage requirements and processing time - A further discussion on this concept can be found here.
While this type of processing allows for reducing index dictionary keys, I am concerned about he likelihood for stemming errors whereby dissimilar words may be stemmed to a similar root, particularly given that indexing speed and space requirements should not be an issue in the application environment - See here for a discussion on over- and under-stemming errors.
And so I ask a barage of questions:
- What are the experiences of fellow monks with natural language stemming?
- Have other monks found better results, as measured by minimal stemming errors, via one stemming algorithm (for example, Paice-Husk, Porter, etc.) over another?
- And in particular, what are other monks experiences with the Porter algorithm of stemming implemented in Lingua::Stem?
My thanks in advance
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Natural Language Index Stemming
by cjf (Parson) on Jun 18, 2002 at 04:58 UTC | |
Re: Natural Language Index Stemming
by samtregar (Abbot) on Jun 18, 2002 at 01:56 UTC | |
Re: Natural Language Index Stemming
by perrin (Chancellor) on Jun 18, 2002 at 01:25 UTC | |
Re: Natural Language Index Stemming
by toma (Vicar) on Jun 18, 2002 at 06:23 UTC | |
Re: Natural Language Index Stemming
by simon.proctor (Vicar) on Jun 18, 2002 at 07:44 UTC | |
Re: Natural Language Index Stemming
by PetaMem (Priest) on Jun 18, 2002 at 11:23 UTC | |
Re: Natural Language Index Stemming
by quinkan (Monk) on Jun 19, 2002 at 06:49 UTC |