how to use spacy lemmatizer to get a word into basic form how to use spacy lemmatizer to get a word into basic form python python

how to use spacy lemmatizer to get a word into basic form


Previous answer is convoluted and can't be edited, so here's a more conventional one.

# make sure your downloaded the english model with "python -m spacy download en"import spacynlp = spacy.load('en')doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")for token in doc:    print(token, token.lemma, token.lemma_)

Output:

Apples 6617 applesand 512 andoranges 7024 orangeare 536 besimilar 1447 similar. 453 .Boots 4622 bootand 512 andhippos 98365 hippoare 536 ben't 538 not. 453 .

From the official Lighting tour


If you want to use just the Lemmatizer, you can do that in the following way:

from spacy.lemmatizer import Lemmatizerfrom spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULESlemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)lemmas = lemmatizer(u'ducks', u'NOUN')print(lemmas)

Output

['duck']

Update

Since spacy version 2.2, LEMMA_INDEX, LEMMA_EXC, and LEMMA_RULES have been bundled into a Lookups Object:

import spacynlp = spacy.load('en')nlp.vocab.lookups>>> <spacy.lookups.Lookups object at 0x7f89a59ea810>nlp.vocab.lookups.tables>>> ['lemma_lookup', 'lemma_rules', 'lemma_index', 'lemma_exc']

You can still use the lemmatizer directly with a word and a POS (part of speech) tag:

from spacy.lemmatizer import Lemmatizer, ADJ, NOUN, VERBlemmatizer = nlp.vocab.morphology.lemmatizerlemmatizer('ducks', NOUN)>>> ['duck']

You can pass the POS tag as the imported constant like above or as string:

lemmatizer('ducks', 'NOUN')>>> ['duck']

from spacy.lemmatizer import Lemmatizer, ADJ, NOUN, VERB


Code :

import osfrom spacy.en import English, LOCAL_DATA_DIRdata_dir = os.environ.get('SPACY_DATA', LOCAL_DATA_DIR)nlp = English(data_dir=data_dir)doc3 = nlp(u"this is spacy lemmatize testing. programming books are more better than others")for token in doc3:    print token, token.lemma, token.lemma_

Output :

this 496 thisis 488 bespacy 173779 spacylemmatize 1510965 lemmatizetesting 2900 testing. 419 .programming 3408 programmingbooks 1011 bookare 488 bemore 529 morebetter 615 betterthan 555 thanothers 871 others

Example Ref: here