What is the best stemming method in Python?
The results you are getting are (generally) expected for a stemmer in English. You say you tried "all the nltk methods" but when I try your examples, that doesn't seem to be the case.
Here are some examples using the PorterStemmer
import nltkps = nltk.stemmer.PorterStemmer()ps.stem('grows')'grow'ps.stem('leaves')'leav'ps.stem('fairly')'fairli'
The results are 'grow', 'leav' and 'fairli' which, even if they are what you wanted, are stemmed versions of the original word.
If we switch to the Snowball stemmer, we have to provide the language as a parameter.
import nltksno = nltk.stem.SnowballStemmer('english')sno.stem('grows')'grow'sno.stem('leaves')'leav'sno.stem('fairly')'fair'
The results are as before for 'grows' and 'leaves' but 'fairly' is stemmed to 'fair'
So in both cases (and there are more than two stemmers available in nltk), words that you say are not stemmed, in fact, are. The LancasterStemmer will return 'easy' when provided with 'easily' or 'easy' as input.
Maybe you really wanted a lemmatizer? That would return 'article' and 'poodle' unchanged.
import nltklemma = nltk.wordnet.WordNetLemmatizer()lemma.lemmatize('article')'article'lemma.lemmatize('leaves')'leaf'
All these stemmers that have been discussed here are algorithmic stemmer,hence they can always produce unexpected results such as
In [3]: from nltk.stem.porter import *In [4]: stemmer = PorterStemmer()In [5]: stemmer.stem('identified')Out[5]: u'identifi'In [6]: stemmer.stem('nonsensical')Out[6]: u'nonsens'
To correctly get the root words one need a dictionary based stemmer such as Hunspell Stemmer.Here is a python implementation of it in the following link. Example code is here
>>> import hunspell>>> hobj = hunspell.HunSpell('/usr/share/myspell/en_US.dic', '/usr/share/myspell/en_US.aff')>>> hobj.spell('spookie')False>>> hobj.suggest('spookie')['spookier', 'spookiness', 'spooky', 'spook', 'spoonbill']>>> hobj.spell('spooky')True>>> hobj.analyze('linked')[' st:link fl:D']>>> hobj.stem('linked')['link']
Stemmers vary in their aggressiveness. Porter is one of the monst aggressive stemmer for English. I find it usually hurts more than it helps.On the lighter side you can either use a lemmatizer instead as already suggested,or a lighter algorithmic stemmer.The limitation of lemmatizers is that they cannot handle unknown words.
Personally I like the Krovetz stemmer which is a hybrid solution, combing a dictionary lemmatizer and a light weight stemmer for out of vocabulary words. Krovetz also kstem
or light_stemmer
option in Elasticsearch. There is a python implementation on pypi https://pypi.org/project/KrovetzStemmer/, though that is not the one that I have used.
Another option is the the lemmatizer in spaCy
. Afte processing with spaCy
every token has a lemma_
attribute. (note the underscore lemma
hold a numerical identifier of the lemma_
) - https://spacy.io/api/token
Here are some papers comparing various stemming algorithms:
- https://www.semanticscholar.org/paper/A-Comparative-Study-of-Stemming-Algorithms-Ms-.-Jivani/1c0c0fa35d4ff8a2f925eb955e48d655494bd167
- https://www.semanticscholar.org/paper/Stemming-Algorithms%3A-A-Comparative-Study-and-their-Sharma/c3efc7d586e242d6a11d047a25b67ecc0f1cce0c?navId=citing-papers
- https://www.semanticscholar.org/paper/Comparative-Analysis-of-Stemming-Algorithms-for-Web/3e598cda5d076552f4a9f89aaa9d79f237882afd
- https://scholar.google.com/scholar?q=related:MhDEzHAUtZ8J:scholar.google.com/&scioq=comparative+stemmers&hl=en&as_sdt=0,5