Need a python module for stemming of text documents
All these stemmers that have been discussed here are algorithmic stemmer,hence they can always produce unexpected results such as
In [3]: from nltk.stem.porter import *In [4]: stemmer = PorterStemmer()In [5]: stemmer.stem('identified')Out[5]: u'identifi'In [6]: stemmer.stem('nonsensical')Out[6]: u'nonsens'
To correctly get the root words one need a dictionary based stemmer such as Hunspell Stemmer.Here is a python implementation of it in the following link. Example code is here
>>> import hunspell>>> hobj = hunspell.HunSpell('/usr/share/myspell/en_US.dic', '/usr/share/myspell/en_US.aff')>>> hobj.spell('spookie')False>>> hobj.suggest('spookie')['spookier', 'spookiness', 'spooky', 'spook', 'spoonbill']>>> hobj.spell('spooky')True>>> hobj.analyze('linked')[' st:link fl:D']>>> hobj.stem('linked')['link']
Python stemming module has implementations of various stemming algorithms like Porter, Porter2, Paice-Husk, and Lovins.http://pypi.python.org/pypi/stemming/1.0
>> from stemming.porter2 import stem >> stem("factionally") faction